You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by tq...@apache.org on 2022/06/11 21:10:19 UTC
[tvm-site] branch asf-site updated: deploying docs (apache/tvm@8f6543e9e6173cd45b678e91b5a637ff7f8e0e02)
This is an automated email from the ASF dual-hosted git repository.
tqchen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/tvm-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 9866c2936 deploying docs (apache/tvm@8f6543e9e6173cd45b678e91b5a637ff7f8e0e02)
9866c2936 is described below
commit 9866c2936171ea4ebb684abdcc387bfbc62cc46e
Author: tvm-bot <95...@users.noreply.github.com>
AuthorDate: Sat Jun 11 21:10:13 2022 +0000
deploying docs (apache/tvm@8f6543e9e6173cd45b678e91b5a637ff7f8e0e02)
---
.../how_to/compile_models/from_mxnet.rst.txt | 2 +-
.../how_to/compile_models/from_oneflow.rst.txt | 2 +-
.../how_to/compile_models/from_paddle.rst.txt | 2 +-
.../how_to/compile_models/from_pytorch.rst.txt | 2 +-
.../how_to/compile_models/from_tensorflow.rst.txt | 2 +-
.../compile_models/sg_execution_times.rst.txt | 22 +-
.../deploy_models/deploy_model_on_android.rst.txt | 2 +-
.../deploy_object_detection_pytorch.rst.txt | 4 +-
.../deploy_models/deploy_prequantized.rst.txt | 6 +-
.../deploy_prequantized_tflite.rst.txt | 4 +-
.../how_to/deploy_models/deploy_quantized.rst.txt | 2 +-
.../deploy_models/deploy_ssd_gluoncv.rst.txt | 4 +-
.../deploy_models/sg_execution_times.rst.txt | 18 +-
.../extend_tvm/bring_your_own_datatypes.rst.txt | 2 +-
.../how_to/extend_tvm/sg_execution_times.rst.txt | 8 +-
.../how_to/extend_tvm/use_pass_instrument.rst.txt | 16 +-
.../optimize_operators/opt_conv_cuda.rst.txt | 2 +-
.../optimize_operators/opt_conv_tensorcore.rst.txt | 2 +-
.../how_to/optimize_operators/opt_gemm.rst.txt | 16 +-
.../optimize_operators/sg_execution_times.rst.txt | 8 +-
.../sg_execution_times.rst.txt | 16 +-
.../tune_conv2d_layer_cuda.rst.txt | 2214 ++++++++++++--------
.../tune_network_cuda.rst.txt | 2 +-
.../tune_network_x86.rst.txt | 4 +-
.../tune_sparse_x86.rst.txt | 86 +-
.../tune_with_autotvm/sg_execution_times.rst.txt | 10 +-
.../tune_with_autotvm/tune_conv2d_cuda.rst.txt | 34 +-
.../work_with_microtvm/micro_autotune.rst.txt | 16 +-
.../how_to/work_with_microtvm/micro_train.rst.txt | 12 +-
.../work_with_microtvm/sg_execution_times.rst.txt | 16 +-
.../work_with_relay/sg_execution_times.rst.txt | 8 +-
.../work_with_schedules/sg_execution_times.rst.txt | 14 +-
.../how_to/work_with_schedules/tensorize.rst.txt | 2 +-
.../tutorials/autotvm/sg_execution_times.rst.txt | 6 +-
.../frontend/deploy_classification.rst.txt | 2 +-
.../tutorials/frontend/deploy_detection.rst.txt | 2 +-
.../tutorials/frontend/sg_execution_times.rst.txt | 6 +-
.../tutorials/optimize/sg_execution_times.rst.txt | 6 +-
.../topic/vta/tutorials/sg_execution_times.rst.txt | 6 +-
.../tutorial/auto_scheduler_matmul_x86.rst.txt | 7 +-
docs/_sources/tutorial/autotvm_relay_x86.rst.txt | 56 +-
.../tutorial/cross_compilation_and_rpc.rst.txt | 2 +-
docs/_sources/tutorial/intro_topi.rst.txt | 2 +-
docs/_sources/tutorial/sg_execution_times.rst.txt | 26 +-
.../tutorial/tensor_expr_get_started.rst.txt | 45 +-
docs/commit_hash | 2 +-
docs/how_to/compile_models/from_mxnet.html | 2 +-
docs/how_to/compile_models/from_oneflow.html | 98 +-
docs/how_to/compile_models/from_paddle.html | 2 +-
docs/how_to/compile_models/from_pytorch.html | 21 +-
docs/how_to/compile_models/from_tensorflow.html | 2 +-
docs/how_to/compile_models/sg_execution_times.html | 22 +-
.../deploy_models/deploy_model_on_android.html | 2 +-
.../deploy_object_detection_pytorch.html | 19 +-
docs/how_to/deploy_models/deploy_prequantized.html | 11 +-
.../deploy_models/deploy_prequantized_tflite.html | 4 +-
docs/how_to/deploy_models/deploy_quantized.html | 2 +-
docs/how_to/deploy_models/deploy_ssd_gluoncv.html | 34 +-
docs/how_to/deploy_models/sg_execution_times.html | 18 +-
.../extend_tvm/bring_your_own_datatypes.html | 2 +-
docs/how_to/extend_tvm/sg_execution_times.html | 8 +-
docs/how_to/extend_tvm/use_pass_instrument.html | 16 +-
docs/how_to/optimize_operators/opt_conv_cuda.html | 2 +-
.../optimize_operators/opt_conv_tensorcore.html | 2 +-
docs/how_to/optimize_operators/opt_gemm.html | 16 +-
.../optimize_operators/sg_execution_times.html | 8 +-
.../sg_execution_times.html | 14 +-
.../tune_conv2d_layer_cuda.html | 2214 ++++++++++++--------
.../tune_with_autoscheduler/tune_network_cuda.html | 2 +-
.../tune_with_autoscheduler/tune_network_x86.html | 4 +-
.../tune_with_autoscheduler/tune_sparse_x86.html | 86 +-
.../tune_with_autotvm/sg_execution_times.html | 10 +-
.../how_to/tune_with_autotvm/tune_conv2d_cuda.html | 34 +-
docs/how_to/work_with_microtvm/micro_autotune.html | 16 +-
docs/how_to/work_with_microtvm/micro_train.html | 12 +-
.../work_with_microtvm/sg_execution_times.html | 14 +-
.../how_to/work_with_relay/sg_execution_times.html | 8 +-
.../work_with_schedules/sg_execution_times.html | 14 +-
docs/how_to/work_with_schedules/tensorize.html | 2 +-
.../classtvm_1_1BaseExpr__inherit__graph.svg | 22 +-
.../classtvm_1_1RelayExpr__inherit__graph.svg | 22 +-
.../classtvm_1_1relay_1_1Constant-members.html | 11 +-
.../api/doxygen/classtvm_1_1relay_1_1Constant.html | 24 +-
.../classtvm_1_1relay_1_1Constant__coll__graph.svg | 138 +-
...asstvm_1_1relay_1_1Constant__inherit__graph.svg | 108 +-
docs/reference/api/doxygen/functions_func_t.html | 7 +-
docs/reference/api/doxygen/functions_t.html | 7 +-
.../api/doxygen/interpreter_8h_source.html | 2 +-
docs/reference/api/doxygen/ir_2expr_8h_source.html | 2 +-
.../api/doxygen/namespacemembers_func_w.html | 3 +-
docs/reference/api/doxygen/namespacemembers_w.html | 3 +-
docs/reference/api/doxygen/namespacetvm.html | 535 ++---
.../api/doxygen/namespacetvm_1_1relay.html | 250 +--
.../api/doxygen/pattern__functor_8h_source.html | 2 +-
docs/reference/api/doxygen/relay_2adt_8h.html | 4 +-
.../api/doxygen/relay_2adt_8h_source.html | 20 +-
.../api/doxygen/relay_2analysis_8h_source.html | 2 +-
.../doxygen/relay_2attrs_2memory_8h_source.html | 4 +-
docs/reference/api/doxygen/relay_2expr_8h.html | 24 +-
.../api/doxygen/relay_2expr_8h_source.html | 171 +-
.../api/doxygen/relay_2expr__functor_8h.html | 2 +-
.../doxygen/relay_2expr__functor_8h_source.html | 100 +-
docs/reference/api/doxygen/relay_2function_8h.html | 2 +-
.../api/doxygen/relay_2function_8h_source.html | 24 +-
.../doxygen/relay_2op__attr__types_8h_source.html | 4 +-
.../api/doxygen/relay_2transform_8h_source.html | 4 +-
.../api/doxygen/runtime_2memory_8h_source.html | 2 +-
docs/reference/api/doxygen/search/all_15.js | 2 +-
docs/reference/api/doxygen/search/all_18.js | 2 +-
docs/reference/api/doxygen/search/functions_14.js | 2 +-
docs/reference/api/doxygen/search/functions_17.js | 2 +-
...m_1_1relay_1_1AllocTensorAttrs__coll__graph.svg | 26 +-
docs/reference/api/python/auto_scheduler.html | 4 +-
.../api/typedoc/classes/bytestreamreader.html | 12 +-
.../api/typedoc/classes/cachedcallstack.html | 34 +-
docs/reference/api/typedoc/classes/dldatatype.html | 12 +-
docs/reference/api/typedoc/classes/dldevice.html | 10 +-
.../reference/api/typedoc/classes/environment.html | 12 +-
docs/reference/api/typedoc/classes/ffilibrary.html | 20 +-
.../api/typedoc/classes/graphexecutor.html | 16 +-
docs/reference/api/typedoc/classes/instance.html | 40 +-
docs/reference/api/typedoc/classes/memory.html | 34 +-
docs/reference/api/typedoc/classes/module.html | 10 +-
docs/reference/api/typedoc/classes/ndarray.html | 22 +-
.../api/typedoc/classes/packedfunccell.html | 6 +-
docs/reference/api/typedoc/classes/rpcserver.html | 14 +-
docs/reference/api/typedoc/classes/scalar.html | 6 +-
.../api/typedoc/classes/webgpucontext.html | 12 +-
docs/reference/api/typedoc/enums/argtypecode.html | 30 +-
.../api/typedoc/enums/aynccallbackcode.html | 4 +-
.../api/typedoc/enums/dldatatypecode.html | 8 +-
.../api/typedoc/enums/rpcserverstate.html | 12 +-
docs/reference/api/typedoc/enums/sizeof.html | 18 +-
docs/reference/api/typedoc/index.html | 112 +-
.../api/typedoc/interfaces/disposable.html | 2 +-
.../api/typedoc/interfaces/functioninfo.html | 6 +-
.../api/typedoc/interfaces/libraryprovider.html | 4 +-
docs/searchindex.js | 2 +-
.../vta/tutorials/autotvm/sg_execution_times.html | 6 +-
.../tutorials/frontend/deploy_classification.html | 2 +-
.../vta/tutorials/frontend/deploy_detection.html | 2 +-
.../vta/tutorials/frontend/sg_execution_times.html | 6 +-
.../vta/tutorials/optimize/sg_execution_times.html | 6 +-
docs/topic/vta/tutorials/sg_execution_times.html | 6 +-
docs/tutorial/auto_scheduler_matmul_x86.html | 3 +-
docs/tutorial/autotvm_relay_x86.html | 266 +--
docs/tutorial/cross_compilation_and_rpc.html | 2 +-
docs/tutorial/intro_topi.html | 2 +-
docs/tutorial/sg_execution_times.html | 26 +-
docs/tutorial/tensor_expr_get_started.html | 41 +-
150 files changed, 4340 insertions(+), 3472 deletions(-)
diff --git a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
index 240fc2b28..68f02b586 100644
--- a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
@@ -98,7 +98,7 @@ In this section, we download a pretrained imagenet model and classify an image.
.. code-block:: none
- Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip4c98474d-b9a5-4b38-b0e2-b9ba0c921c4d from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+ Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip3aa5fed0-f0aa-4b27-b09c-8203092b35b5 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
x (1, 3, 224, 224)
diff --git a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
index 2ce63faad..233e3614a 100644
--- a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
@@ -100,7 +100,7 @@ Load a pretrained OneFlow model and save model
.. code-block:: none
Downloading: "https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip" to /workspace/.oneflow/flowvision_cache/resnet18.zip
-
0%| | 0.00/41.5M [00:00<?, ?B/s]
0%| | 16.0k/41.5M [00:00<08:11, 88.4kB/s]
0%| | 48.0k/41.5M [00:00<05:10, 140kB/s]
0%| | 72.0k/41.5M [00:00<05:18, 136kB/s]
0%| | 160k/41.5M [00:00<02:38, 273kB/s]
1%| | 328k/41.5M [00:00<01:25, 508kB/s]
1%|1 | 512k/41.5M [00:01<01:03, 679kB/s]
2%|1 | 704k/41.5M [00:01<00:53, 802kB/s]
2%|2 | 904k/41.5M [00:01<00:47, 896kB/s]
3%|2 | 1.09M/41.5M [00:01<00:42, 987kB/s]
3%|3 | 1.31M/41.5M [00:01<00:39, 1.06MB/s]
4%|3 | 1.54M/41.5M [00:02<00:37, 1.13MB/s]
4%|4 | 1.78M/41.5M [00:02<00:34, 1.20MB/s]
5%|4 | 2.03M/41.5M [00:02<00:32, 1.26MB/s]
6%|5 | 2.30M/41.5M [00:02<00:30, 1.33MB/s]
6%|6 | 2.57M/41.5M [00:02<00:29, 1.39MB/s]
7%|6 | 2.87M/41.5M [00:02<00:27, 1.48MB/s]
8%|7 | 3.17M/41.5M [00:03<00:25,
1.55MB/s]
8%|8 | 3.49M/41.5M [00:03<00:24, 1.63MB/s]
9%|9 | 3.83M/41.5M [00:03<00:23, 1.70MB/s]
10%|# | 4.18M/41.5M [00:03<00:21, 1.79MB/s]
11%|# | 4.55M/41.5M [00:03<00:20, 1.87MB/s]
12%|#1 | 4.93M/41.5M [00:04<00:19, 1.96MB/s]
13%|#2 | 5.34M/41.5M [00:04<00:18, 2.05MB/s]
14%|#3 | 5.76M/41.5M [00:04<00:17, 2.15MB/s]
15%|#4 | 6.20M/41.5M [00:04<00:16, 2.26MB/s]
16%|#6 | 6.67M/41.5M [00:04<00:15, 2.37MB/s]
17%|#7 | 7.16M/41.5M [00:05<00:14, 2.49MB/s]
19%|#8 | 7.68M/41.5M [00:05<00:13, 2.61MB/s]
20%|#9 | 8.22M/41.5M [00:05<00:12, 2.74MB/s]
21%|##1 | 8.79M/41.5M [00:05<00:11, 2.88MB/s]
23%|##2 | 9.39M/41.5M [00:05<00:11, 3.03MB/s]
24%|##4 | 10.0M/41.5M [00:05<00:10, 3.18MB/s]
26%|##5 | 10.7M/41.5M [00:06<00:09, 3.33MB/s]
27%|##7 | 11.4M/41.5M [00:06<00:09, 3.49MB/s]
29%|##9
| 12.1M/41.5M [00:06<00:08, 3.67MB/s]
31%|### | 12.8M/41.5M [00:06<00:07, 3.85MB/s]
33%|###2 | 13.6M/41.5M [00:06<00:07, 4.04MB/s]
35%|###4 | 14.5M/41.5M [00:07<00:06, 4.24MB/s]
37%|###7 | 15.4M/41.5M [00:07<00:06, 4.44MB/s]
39%|###9 | 16.3M/41.5M [00:07<00:05, 4.66MB/s]
42%|####1 | 17.2M/41.5M [00:07<00:05, 4.88MB/s]
44%|####3 | 18.2M/41.5M [00:07<00:04, 5.13MB/s]
47%|####6 | 19.3M/41.5M [00:07<00:04, 5.63MB/s]
49%|####9 | 20.4M/41.5M [00:08<00:03, 5.83MB/s]
52%|#####2 | 21.6M/41.5M [00:08<00:03, 6.08MB/s]
55%|#####5 | 22.8M/41.5M [00:08<00:03, 6.33MB/s]
58%|#####8 | 24.1M/41.5M [00:08<00:02, 6.61MB/s]
61%|######1 | 25.5M/41.5M [00:08<00:02, 6.62MB/s]
65%|######4 | 26.9M/41.5M [00:09<00:02, 7.04MB/s]
68%|######8 | 28.4M/41.5M [00:09<00:01, 7.41MB/s]
72%|#######1 | 29.9M/41.5M [00:09<00:01, 7.66MB/s]
76%|#######5 | 31.3M/41.5M [00:09<00:01,
7.85MB/s]
79%|#######9 | 32.8M/41.5M [00:09<00:01, 7.99MB/s]
83%|########2 | 34.3M/41.5M [00:10<00:00, 8.06MB/s]
86%|########6 | 35.7M/41.5M [00:10<00:00, 8.12MB/s]
90%|########9 | 37.2M/41.5M [00:10<00:00, 8.15MB/s]
93%|#########3| 38.7M/41.5M [00:10<00:00, 8.20MB/s]
97%|#########6| 40.1M/41.5M [00:10<00:00, 8.22MB/s]
100%|##########| 41.5M/41.5M [00:10<00:00, 4.02MB/s]
+
0%| | 0.00/41.5M [00:00<?, ?B/s]
0%| | 16.0k/41.5M [00:00<07:25, 97.6kB/s]
0%| | 48.0k/41.5M [00:00<04:41, 154kB/s]
0%| | 72.0k/41.5M [00:00<04:48, 150kB/s]
0%| | 136k/41.5M [00:00<02:57, 244kB/s]
1%| | 288k/41.5M [00:00<01:28, 489kB/s]
1%|1 | 592k/41.5M [00:01<00:45, 950kB/s]
3%|2 | 1.17M/41.5M [00:01<00:22, 1.84MB/s]
6%|5 | 2.35M/41.5M [00:01<00:11, 3.59MB/s]
9%|9 | 3.82M/41.5M [00:01<00:07, 5.32MB/s]
13%|#2 | 5.29M/41.5M [00:01<00:05, 6.49MB/s]
16%|#6 | 6.77M/41.5M [00:01<00:04, 7.31MB/s]
20%|#9 | 8.23M/41.5M [00:02<00:04, 7.86MB/s]
23%|##3 | 9.70M/41.5M [00:02<00:03, 8.80MB/s]
27%|##6 | 11.1M/41.5M [00:02<00:03, 9.53MB/s]
30%|### | 12.6M/41.5M [00:02<00:02, 10.8MB/s]
33%|###2 | 13.7M/41.5M [00:02<00:02, 9.76MB/s]
35%|###5 | 14.7M/41.5M [00:02<00
:03, 8.57MB/s]
38%|###7 | 15.6M/41.5M [00:02<00:03, 8.23MB/s]
41%|####1 | 17.0M/41.5M [00:02<00:02, 9.71MB/s]
43%|####3 | 18.0M/41.5M [00:03<00:02, 9.74MB/s]
46%|####5 | 19.0M/41.5M [00:03<00:02, 8.42MB/s]
48%|####8 | 20.0M/41.5M [00:03<00:02, 7.72MB/s]
52%|#####1 | 21.4M/41.5M [00:03<00:02, 8.80MB/s]
55%|#####5 | 22.9M/41.5M [00:03<00:01, 10.3MB/s]
58%|#####7 | 23.9M/41.5M [00:03<00:01, 10.2MB/s]
60%|###### | 25.0M/41.5M [00:03<00:01, 8.80MB/s]
62%|######2 | 25.9M/41.5M [00:04<00:02, 7.75MB/s]
66%|######5 | 27.3M/41.5M [00:04<00:01, 8.19MB/s]
69%|######9 | 28.8M/41.5M [00:04<00:01, 8.49MB/s]
73%|#######2 | 30.2M/41.5M [00:04<00:01, 9.69MB/s]
76%|#######6 | 31.7M/41.5M [00:04<00:00, 10.9MB/s]
79%|#######9 | 32.8M/41.5M [00:04<00:00, 10.3MB/s]
82%|########1 | 33.8M/41.5M [00:04<00:00, 8.95MB/s]
84%|########3 | 34.8M/41.5M [00:05<00:00, 7.91MB/s]
87%|####
####7 | 36.1M/41.5M [00:05<00:00, 8.11MB/s]
91%|######### | 37.6M/41.5M [00:05<00:00, 8.43MB/s]
94%|#########4| 39.1M/41.5M [00:05<00:00, 8.66MB/s]
98%|#########7| 40.5M/41.5M [00:05<00:00, 8.80MB/s]
100%|##########| 41.5M/41.5M [00:05<00:00, 7.57MB/s]
diff --git a/docs/_sources/how_to/compile_models/from_paddle.rst.txt b/docs/_sources/how_to/compile_models/from_paddle.rst.txt
index 8e438b380..6da763052 100644
--- a/docs/_sources/how_to/compile_models/from_paddle.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_paddle.rst.txt
@@ -210,7 +210,7 @@ Look up prediction top 1 index in 1000 class synset.
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 1 minutes 6.972 seconds)
+ **Total running time of the script:** ( 1 minutes 7.772 seconds)
.. _sphx_glr_download_how_to_compile_models_from_paddle.py:
diff --git a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
index ef1692f7e..36602fe7e 100644
--- a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
@@ -79,7 +79,7 @@ Load a pretrained PyTorch model
.. code-block:: none
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
-
0%| | 0.00/44.7M [00:00<?, ?B/s]
5%|5 | 2.38M/44.7M [00:00<00:01, 24.8MB/s]
11%|# | 4.81M/44.7M [00:00<00:01, 25.0MB/s]
17%|#6 | 7.38M/44.7M [00:00<00:01, 25.4MB/s]
22%|##1 | 9.81M/44.7M [00:00<00:01, 25.3MB/s]
28%|##7 | 12.3M/44.7M [00:00<00:01, 25.6MB/s]
33%|###3 | 14.8M/44.7M [00:00<00:01, 25.7MB/s]
39%|###8 | 17.3M/44.7M [00:00<00:01, 24.8MB/s]
45%|####4 | 20.0M/44.7M [00:00<00:01, 25.7MB/s]
50%|##### | 22.5M/44.7M [00:00<00:00, 25.7MB/s]
56%|#####5 | 25.0M/44.7M [00:01<00:00, 25.9MB/s]
62%|######1 | 27.5M/44.7M [00:01<00:00, 25.7MB/s]
67%|######7 | 29.9M/44.7M [00:01<00:00, 25.2MB/s]
73%|#######2 | 32.4M/44.7M [00:01<00:00, 25.5MB/s]
78%|#######8 | 35.0M/44.7M [00:01<00:00, 26.0MB/s]
84%|########4 | 37.5M/44.7M [00:01<00:00, 25.5MB/s]
90%|########9 | 40.2M/44.7M [00:01<00:00, 26.0MB/s]
96%|#########5| 42.7M/44.7M [00
:01<00:00, 26.1MB/s]
100%|##########| 44.7M/44.7M [00:01<00:00, 25.7MB/s]
+
0%| | 0.00/44.7M [00:00<?, ?B/s]
41%|#### | 18.3M/44.7M [00:00<00:00, 192MB/s]
97%|#########6| 43.2M/44.7M [00:00<00:00, 233MB/s]
100%|##########| 44.7M/44.7M [00:00<00:00, 227MB/s]
diff --git a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
index b74d2a754..38ad504a6 100644
--- a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
@@ -381,7 +381,7 @@ Run the corresponding model on tensorflow
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 1 minutes 5.091 seconds)
+ **Total running time of the script:** ( 1 minutes 5.517 seconds)
.. _sphx_glr_download_how_to_compile_models_from_tensorflow.py:
diff --git a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
index 4dcc5a04c..6caf0c74c 100644
--- a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
@@ -5,15 +5,15 @@
Computation times
=================
-**05:36.236** total execution time for **how_to_compile_models** files:
+**05:33.413** total execution time for **how_to_compile_models** files:
-- **01:06.972**: :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)
-- **01:05.091**: :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``)
-- **00:59.192**: :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)
-- **00:36.839**: :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)
-- **00:24.334**: :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)
-- **00:23.179**: :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)
-- **00:22.183**: :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)
-- **00:21.079**: :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)
-- **00:14.402**: :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)
-- **00:02.966**: :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)
+- **01:07.772**: :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)
+- **01:05.517**: :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``)
+- **00:58.763**: :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)
+- **00:32.928**: :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)
+- **00:24.649**: :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)
+- **00:24.258**: :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)
+- **00:22.451**: :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)
+- **00:20.237**: :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)
+- **00:14.307**: :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)
+- **00:02.530**: :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
index 789ee98a2..290cd326e 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
@@ -402,7 +402,7 @@ Execute on TVM
Evaluate inference time cost...
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 16.3155 16.2747 16.7494 16.0559 0.2256
+ 16.5542 16.7218 17.0643 15.8826 0.4657
diff --git a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
index 15748b6fe..bc6ca0e79 100644
--- a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
@@ -108,7 +108,7 @@ Load pre-trained maskrcnn from torchvision and do tracing
.. code-block:: none
Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
-
0%| | 0.00/170M [00:00<?, ?B/s]
10%|# | 17.5M/170M [00:00<00:00, 183MB/s]
25%|##4 | 42.1M/170M [00:00<00:00, 227MB/s]
39%|###9 | 66.5M/170M [00:00<00:00, 240MB/s]
54%|#####3 | 91.0M/170M [00:00<00:00, 246MB/s]
67%|######7 | 115M/170M [00:00<00:00, 242MB/s]
82%|########1 | 139M/170M [00:00<00:00, 246MB/s]
96%|#########6| 163M/170M [00:00<00:00, 250MB/s]
100%|##########| 170M/170M [00:00<00:00, 243MB/s]
+
0%| | 0.00/170M [00:00<?, ?B/s]
2%|1 | 2.59M/170M [00:00<00:06, 26.6MB/s]
4%|3 | 6.02M/170M [00:00<00:05, 32.0MB/s]
16%|#6 | 27.4M/170M [00:00<00:01, 119MB/s]
27%|##7 | 46.4M/170M [00:00<00:00, 151MB/s]
43%|####3 | 73.6M/170M [00:00<00:00, 199MB/s]
59%|#####8 | 99.9M/170M [00:00<00:00, 225MB/s]
75%|#######4 | 127M/170M [00:00<00:00, 243MB/s]
88%|########8 | 150M/170M [00:00<00:00, 230MB/s]
100%|##########| 170M/170M [00:00<00:00, 196MB/s]
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3878: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
for i in range(dim)
/usr/local/lib/python3.7/dist-packages/torchvision/models/detection/anchor_utils.py:127: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
@@ -262,7 +262,7 @@ Get boxes with score larger than 0.9
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 2 minutes 58.856 seconds)
+ **Total running time of the script:** ( 2 minutes 58.144 seconds)
.. _sphx_glr_download_how_to_deploy_models_deploy_object_detection_pytorch.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
index c2dd51647..90a73d0d1 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
@@ -187,7 +187,7 @@ training. Other models require a full post training calibration.
.. code-block:: none
Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
-
0%| | 0.00/13.6M [00:00<?, ?B/s]
41%|#### | 5.55M/13.6M [00:00<00:00, 57.7MB/s]
82%|########1 | 11.1M/13.6M [00:00<00:00, 56.1MB/s]
100%|##########| 13.6M/13.6M [00:00<00:00, 65.6MB/s]
+
0%| | 0.00/13.6M [00:00<?, ?B/s]
28%|##8 | 3.81M/13.6M [00:00<00:00, 39.7MB/s]
56%|#####6 | 7.60M/13.6M [00:00<00:00, 36.1MB/s]
86%|########6 | 11.7M/13.6M [00:00<00:00, 38.8MB/s]
100%|##########| 13.6M/13.6M [00:00<00:00, 37.0MB/s]
@@ -353,7 +353,7 @@ Here we give an example of how to measure performance of TVM compiled models.
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 90.3550 90.2908 93.6545 90.0487 0.3945
+ 90.3869 90.3271 93.3037 90.2228 0.3143
@@ -393,7 +393,7 @@ TODO
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 1 minutes 8.392 seconds)
+ **Total running time of the script:** ( 1 minutes 8.345 seconds)
.. _sphx_glr_download_how_to_deploy_models_deploy_prequantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
index 7dd06bdca..700f60d4a 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
@@ -360,7 +360,7 @@ Here we give an example of how to measure performance of TVM compiled models.
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 119.9743 119.8743 123.1352 119.1519 0.5602
+ 121.2660 121.1909 123.7286 120.5713 0.4329
@@ -394,7 +394,7 @@ Here we give an example of how to measure performance of TVM compiled models.
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 1 minutes 59.433 seconds)
+ **Total running time of the script:** ( 1 minutes 55.889 seconds)
.. _sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
index c29d57704..ea67aa1a4 100644
--- a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
@@ -223,7 +223,7 @@ We create a Relay VM to build and execute the model.
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 1 minutes 29.627 seconds)
+ **Total running time of the script:** ( 1 minutes 12.332 seconds)
.. _sphx_glr_download_how_to_deploy_models_deploy_quantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
index da3a02e67..bb9a0f614 100644
--- a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
@@ -137,7 +137,7 @@ Convert and compile model for CPU.
data: None
input_sym_arg_type = in_param.infer_type()[0]
Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
-
0%| | 0/132723 [00:00<?, ?KB/s]
3%|3 | 4435/132723 [00:00<00:02, 43565.93KB/s]
10%|9 | 12736/132723 [00:00<00:01, 66595.18KB/s]
16%|#6 | 21534/132723 [00:00<00:01, 76318.71KB/s]
23%|##2 | 30396/132723 [00:00<00:01, 81161.08KB/s]
30%|##9 | 39264/132723 [00:00<00:01, 83864.43KB/s]
36%|###6 | 48180/132723 [00:00<00:00, 85660.24KB/s]
43%|####2 | 57015/132723 [00:00<00:00, 86537.00KB/s]
50%|####9 | 65862/132723 [00:00<00:00, 87150.29KB/s]
56%|#####6 | 74745/132723 [00:00<00:00, 87673.92KB/s]
63%|######2 | 83610/132723 [00:01<00:00, 87971.77KB/s]
70%|######9 | 92546/132723 [00:01<00:00, 88392.06KB/s]
76%|#######6 | 101465/132723 [00:01<00:00, 88632.85KB/s]
83%|########3 | 110353/132723 [00:01<00:00, 88706.07KB/s]
90%|########9 | 119224/132723 [00:01<00:00, 88578.13KB/s]
97%|#########6| 128083/132723 [00:01<00:00, 88200.81KB/s]
100%|#######
###| 132723/132723 [00:01<00:00, 85184.13KB/s]
+
0%| | 0/132723 [00:00<?, ?KB/s]
5%|5 | 7131/132723 [00:00<00:01, 71302.59KB/s]
12%|#1 | 15498/132723 [00:00<00:01, 78571.54KB/s]
18%|#8 | 23936/132723 [00:00<00:01, 81217.31KB/s]
24%|##4 | 32465/132723 [00:00<00:01, 82821.55KB/s]
31%|### | 40991/132723 [00:00<00:01, 83698.67KB/s]
37%|###7 | 49571/132723 [00:00<00:00, 84409.61KB/s]
44%|####3 | 58058/132723 [00:00<00:00, 84557.24KB/s]
50%|##### | 66578/132723 [00:00<00:00, 84760.14KB/s]
57%|#####6 | 75055/132723 [00:00<00:00, 82763.80KB/s]
63%|######2 | 83602/132723 [00:01<00:00, 83582.62KB/s]
69%|######9 | 92214/132723 [00:01<00:00, 84348.45KB/s]
76%|#######5 | 100781/132723 [00:01<00:00, 84743.24KB/s]
82%|########2 | 109329/132723 [00:01<00:00, 84961.90KB/s]
89%|########8 | 117995/132723 [00:01<00:00, 85469.97KB/s]
95%|#########5| 126597/132723 [00:01<00:00, 85632.38KB/s]
100%|#######
###| 132723/132723 [00:01<00:00, 83967.03KB/s]
@@ -211,7 +211,7 @@ Display result
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 2 minutes 22.655 seconds)
+ **Total running time of the script:** ( 2 minutes 21.806 seconds)
.. _sphx_glr_download_how_to_deploy_models_deploy_ssd_gluoncv.py:
diff --git a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
index 6ab41d2e3..a52f849a7 100644
--- a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
@@ -5,13 +5,13 @@
Computation times
=================
-**10:51.763** total execution time for **how_to_deploy_models** files:
+**10:28.339** total execution time for **how_to_deploy_models** files:
-- **02:58.856**: :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``)
-- **02:22.655**: :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)
-- **01:59.433**: :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)
-- **01:29.627**: :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)
-- **01:08.392**: :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)
-- **00:30.420**: :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)
-- **00:22.175**: :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)
-- **00:00.205**: :ref:`sphx_glr_how_to_deploy_models_deploy_sparse.py` (``deploy_sparse.py``)
+- **02:58.144**: :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``)
+- **02:21.806**: :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)
+- **01:55.889**: :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)
+- **01:12.332**: :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)
+- **01:08.345**: :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)
+- **00:29.408**: :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)
+- **00:22.206**: :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)
+- **00:00.208**: :ref:`sphx_glr_how_to_deploy_models_deploy_sparse.py` (``deploy_sparse.py``)
diff --git a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
index e29edeaf7..bc4e2dfb7 100644
--- a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
@@ -425,7 +425,7 @@ First let us define two helper functions to get the mobilenet model and a cat im
.. code-block:: none
- Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip1beec7b5-11b5-4b0d-a47a-6a5a0f40d2cc from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+ Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip5cf78752-2709-4777-b769-e323af70ff83 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
diff --git a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
index 5425d256f..500dd85be 100644
--- a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
@@ -5,9 +5,9 @@
Computation times
=================
-**00:41.304** total execution time for **how_to_extend_tvm** files:
+**00:41.867** total execution time for **how_to_extend_tvm** files:
-- **00:37.436**: :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``)
-- **00:02.498**: :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)
-- **00:01.155**: :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)
+- **00:38.032**: :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``)
+- **00:02.477**: :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)
+- **00:01.144**: :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)
- **00:00.215**: :ref:`sphx_glr_how_to_extend_tvm_low_level_custom_pass.py` (``low_level_custom_pass.py``)
diff --git a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
index 9a729aed8..139afa229 100644
--- a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
@@ -199,10 +199,10 @@ profile the execution time of each passes.
.. code-block:: none
Printing results of timing profile...
- InferType: 7436us [7436us] (47.14%; 47.14%)
- FoldScaleAxis: 8338us [6us] (52.86%; 52.86%)
- FoldConstant: 8332us [1672us] (52.82%; 99.92%)
- InferType: 6660us [6660us] (42.22%; 79.93%)
+ InferType: 6663us [6663us] (45.53%; 45.53%)
+ FoldScaleAxis: 7972us [6us] (54.47%; 54.47%)
+ FoldConstant: 7967us [1604us] (54.44%; 99.93%)
+ InferType: 6363us [6363us] (43.48%; 79.87%)
@@ -239,10 +239,10 @@ Refer to following sections and :py:func:`tvm.instrument.pass_instrument` for th
.. code-block:: none
Printing results of timing profile...
- InferType: 6691us [6691us] (44.81%; 44.81%)
- FoldScaleAxis: 8242us [6us] (55.19%; 55.19%)
- FoldConstant: 8236us [1687us] (55.15%; 99.93%)
- InferType: 6549us [6549us] (43.86%; 79.52%)
+ InferType: 7142us [7142us] (47.08%; 47.08%)
+ FoldScaleAxis: 8029us [6us] (52.92%; 52.92%)
+ FoldConstant: 8024us [1643us] (52.88%; 99.93%)
+ InferType: 6380us [6380us] (42.05%; 79.52%)
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
index 2f3b20de9..0700ec093 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
@@ -295,7 +295,7 @@ latency of convolution.
.. code-block:: none
- Convolution: 54.197352 ms
+ Convolution: 54.211124 ms
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
index a1f34e47f..80bb16768 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
@@ -628,7 +628,7 @@ be able to run on our build server
.. code-block:: none
- conv2d with tensor core: 6.542161 ms
+ conv2d with tensor core: 6.877208 ms
diff --git a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
index c03df39b9..7e688845e 100644
--- a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
@@ -118,8 +118,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
.. code-block:: none
- Numpy running time: 0.019223
- Baseline: 3.278164
+ Numpy running time: 0.019617
+ Baseline: 3.480961
@@ -210,7 +210,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
.. code-block:: none
- Opt1: 0.318309
+ Opt1: 0.311245
@@ -309,7 +309,7 @@ In this tutorial, we chose to vectorize the inner loop row data since it is cach
.. code-block:: none
- Opt2: 0.343364
+ Opt2: 0.342973
@@ -401,7 +401,7 @@ the access pattern for A matrix is more cache friendly.
.. code-block:: none
- Opt3: 0.125443
+ Opt3: 0.121032
@@ -520,7 +520,7 @@ flattening.
.. code-block:: none
- Opt4: 0.111377
+ Opt4: 0.111244
@@ -638,7 +638,7 @@ write to C when all the block results are ready.
.. code-block:: none
- Opt5: 0.112086
+ Opt5: 0.112666
@@ -759,7 +759,7 @@ Futhermore, we can also utilize multi-core processors to do the thread-level par
.. code-block:: none
- Opt6: 0.146824
+ Opt6: 0.146817
diff --git a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
index cab6a286f..632125614 100644
--- a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
@@ -5,8 +5,8 @@
Computation times
=================
-**00:35.260** total execution time for **how_to_optimize_operators** files:
+**00:35.905** total execution time for **how_to_optimize_operators** files:
-- **00:32.592**: :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)
-- **00:01.437**: :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``)
-- **00:01.231**: :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)
+- **00:33.069**: :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)
+- **00:01.542**: :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``)
+- **00:01.295**: :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
index 20fe9fe9d..d70982566 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
@@ -5,11 +5,11 @@
Computation times
=================
-**05:23.442** total execution time for **how_to_tune_with_autoscheduler** files:
-
-- **02:32.857**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``)
-- **01:21.762**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)
-- **00:43.716**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)
-- **00:27.262**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)
-- **00:09.003**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)
-- **00:08.842**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)
+**05:21.893** total execution time for **how_to_tune_with_autoscheduler** files:
+
+- **02:42.124**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``)
+- **01:21.321**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)
+- **00:43.525**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)
+- **00:17.397**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)
+- **00:08.861**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)
+- **00:08.665**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
index 5e03a9224..185a4b42b 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
@@ -222,483 +222,669 @@ cooperative fetching, unrolling and operator fusion.
compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])}
buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute}
preflattened_buffer_map = {data_1: data_3: Buffer(data_2, float32, [1, 512, 7, 7], []), kernel_1: kernel_3: Buffer(kernel_2, float32, [512, 512, 3, 3], []), bias_1: bias_3: Buffer(bias_2, float32, [1, 512, 1, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [1, 512, 7, 7], [])} {
- attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 28;
- allocate(conv2d_nchw: Pointer(local float32), float32, [14]), storage_scope = local;
- allocate(pad_temp.shared: Pointer(shared float32), float32, [72]), storage_scope = shared;
- allocate(kernel.shared: Pointer(shared float32), float32, [3072]), storage_scope = shared;
- attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64 {
- conv2d_nchw_1: Buffer(conv2d_nchw, float32, [14], [], scope="local", align=32)[0] = 0f32
+ attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 32;
+ allocate(conv2d_nchw: Pointer(local float32), float32, [2]), storage_scope = local;
+ allocate(pad_temp.shared: Pointer(shared float32), float32, [2016]), storage_scope = shared;
+ allocate(kernel.shared: Pointer(shared float32), float32, [1536]), storage_scope = shared;
+ attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392 {
+ conv2d_nchw_1: Buffer(conv2d_nchw, float32, [2], [], scope="local", align=8)[0] = 0f32
conv2d_nchw_1[1] = 0f32
- conv2d_nchw_1[2] = 0f32
- conv2d_nchw_1[3] = 0f32
- conv2d_nchw_1[4] = 0f32
- conv2d_nchw_1[5] = 0f32
- conv2d_nchw_1[6] = 0f32
- conv2d_nchw_1[7] = 0f32
- conv2d_nchw_1[8] = 0f32
- conv2d_nchw_1[9] = 0f32
- conv2d_nchw_1[10] = 0f32
- conv2d_nchw_1[11] = 0f32
- conv2d_nchw_1[12] = 0f32
- conv2d_nchw_1[13] = 0f32
- for (rc.outer.outer: int32, 0, 64) {
- for (ry.outer.outer: int32, 0, 3) {
- let cse_var_2: int32 = (rc.outer.outer*72)
- let cse_var_1: int32 = (ry.outer.outer*3)
- {
- attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64 {
- if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
- pad_temp.shared_1: Buffer(pad_temp.shared, float32, [72], [], scope="shared")[(threadIdx.x_1*4)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod((threadIdx.x_1*4), 9))) && (floormod((threadIdx.x_1*4), 9) < 8)), data[((((((rc.outer.outer*392) + (floordiv((threadIdx.x_1*4), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod((threadIdx.x_1*4), 9)) - 8)], 0f3 [...]
- }
- if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
- pad_temp.shared_1[((threadIdx.x_1*4) + 1)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 1), 9))) && (floormod(((threadIdx.x_1*4) + 1), 9) < 8)), data[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 1), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 1), 9)) - 8)], 0f32, dtype=float32)
- }
- if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
- pad_temp.shared_1[((threadIdx.x_1*4) + 2)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 2), 9))) && (floormod(((threadIdx.x_1*4) + 2), 9) < 8)), data[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 2), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 2), 9)) - 8)], 0f32, dtype=float32)
- }
- if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
- pad_temp.shared_1[((threadIdx.x_1*4) + 3)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 3), 9))) && (floormod(((threadIdx.x_1*4) + 3), 9) < 8)), data[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 3), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 3), 9)) - 8)], 0f32, dtype=float32)
- }
- }
- attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1: Buffer(kernel.shared, float32, [3072], [], scope="shared")[threadIdx.x_2] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 64)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 8), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 128)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 16), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 32), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 192)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 36864)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 256)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 32), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 64), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 320)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 40), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 80), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 384)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 73728)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 448)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 56), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 112), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 512)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 64), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 128), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 576)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 110592)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 640)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 80), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 160), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 704)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 88), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 176), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 768)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 147456)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 832)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 104), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 208), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 896)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 112), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 224), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 960)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 184320)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1024)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 128), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 256), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1088)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 136), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 272), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1152)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 221184)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1216)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 152), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 304), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1280)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 160), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 320), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1344)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 258048)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1408)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 176), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 352), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1472)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 184), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 368), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1536)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 294912)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1600)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 200), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 400), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1664)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 208), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 416), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1728)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 331776)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1792)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 224), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 448), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1856)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 232), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 464), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1920)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 368640)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1984)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 248), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 496), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2048)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 256), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 512), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2112)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 405504)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2176)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 272), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 544), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2240)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 280), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 560), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2304)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 442368)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2368)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 296), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 592), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2432)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 304), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 608), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2496)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 479232)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2560)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 320), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 640), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2624)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 328), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 656), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2688)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 516096)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2752)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 344), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 688), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2816)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 352), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 704), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2880)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 552960)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2944)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 368), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 736), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 3008)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 376), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 752), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[0]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[9]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[1]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[10]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[2]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[3]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[4]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[5]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[6]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[0]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[9]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[1]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[10]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[1]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[10]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[7]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[16]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[1]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[10]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[7]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[16]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[7]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[16]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[8]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[17]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[7]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[16]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[8]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[17]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[18]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[27]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[19]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[28]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[18]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[27]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[19]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[28]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[19]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[28]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[25]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[34]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[19]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[28]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[25]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[34]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[25]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[34]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[26]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[35]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[25]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[34]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[26]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[35]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[36]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[45]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[37]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[46]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[36]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[45]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[37]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[46]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[37]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[46]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[43]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[52]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[37]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[46]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[43]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[52]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[43]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[52]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[44]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[53]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[43]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[52]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[44]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[53]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[54]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[63]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[55]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[64]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[54]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[63]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[55]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[64]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[55]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[64]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[61]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[70]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[55]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[64]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[61]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[70]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[61]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[70]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[62]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[71]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[61]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[70]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[62]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[71]*kernel.shared_1[((threadIdx.x*48) + 47)]))
+ for (rc.outer.outer: int32, 0, 16) {
+ let cse_var_2: int32 = (rc.outer.outer*1568)
+ let cse_var_1: int32 = (rc.outer.outer*288)
+ {
+ attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1: Buffer(pad_temp.shared, float32, [2016], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else((((7 <= floormod(threadIdx.x_1, 63)) && (floormod(threadIdx.x_1, 63) < 56)) && (1 <= floormod(threadIdx.x_1, 7))), data[(((cse_var_2 + (floordiv(threadIdx.x_1, 63)*49)) + floormod(threadIdx.x_1, 63)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 2), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 2), 9) < 8)) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 56), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 2), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 4), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 4), 9) < 8)) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 112), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 4), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 6), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 6), 9) < 8)) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 168), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 6), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1568)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 8), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 8), 9) < 8)) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 224), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 8), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_1 < 56), dtype=bool) {
+ pad_temp.shared_1[(threadIdx.x_1 + 1960)] = @tir.if_then_else(((floormod((floordiv(threadIdx.x_1, 7) + 1), 9) < 8) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 280), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 1), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
}
+ attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1: Buffer(kernel.shared, float32, [1536], [], scope="shared")[threadIdx.x_2] = kernel[((((blockIdx.x*73728) + (floordiv(threadIdx.x_2, 96)*4608)) + cse_var_1) + (floormod(threadIdx.x_2, 96)*3))]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[(((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 49), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 8), 96), 3)*9)) + (floormod((threadIdx.x_2 + 2), 3)*3))]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 784)] = kernel[(((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 98), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 16), 96), 3)*9)) + (floormod((threadIdx.x_2 + 1), 3)*3))]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_2 < 360), dtype=bool) {
+ kernel.shared_1[(threadIdx.x_2 + 1176)] = kernel[(((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 147), 12)*4608)) + cse_var_1) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 32)*9)) + (floormod(threadIdx.x_2, 3)*3))]
+ }
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[(floordiv(threadIdx.x, 49)*192)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 96)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 1)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 97)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 2)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 98)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 3)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 99)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 4)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 100)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 5)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 101)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 6)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 102)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 7)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 103)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 8)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 104)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 9)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 105)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 10)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 106)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 11)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 107)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 12)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 108)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 13)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 109)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 14)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 110)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 15)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 111)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 16)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 112)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 17)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 113)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 18)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 114)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 19)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 115)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 20)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 116)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 21)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 117)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 22)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 118)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 23)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 119)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 24)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 120)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 25)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 121)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 26)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 122)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 27)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 123)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 28)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 124)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 29)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 125)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 30)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 126)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 31)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 127)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 32)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 128)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 33)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 129)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 34)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 130)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 35)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 131)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 36)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 132)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 37)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 133)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 38)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 134)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 39)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 135)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 40)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 136)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 41)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 137)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 42)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 138)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 43)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 139)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 44)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 140)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 45)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 141)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 46)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 142)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 47)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 143)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 48)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 144)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 49)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 145)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 50)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 146)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 51)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 147)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 52)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 148)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 53)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 149)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 54)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 150)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 55)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 151)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 56)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 152)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 57)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 153)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 58)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 154)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 59)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 155)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 60)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 156)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 61)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 157)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 62)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 158)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 63)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 159)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 64)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 160)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 65)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 161)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 66)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 162)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 67)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 163)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 68)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 164)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 69)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 165)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 70)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 166)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 71)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 167)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 72)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 168)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 73)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 169)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 74)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 170)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 75)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 171)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 76)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 172)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 77)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 173)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 78)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 174)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 79)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 175)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 80)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 176)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 81)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 177)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 82)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 178)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 83)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 179)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 84)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 180)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 85)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 181)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 86)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 182)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 87)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 183)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 88)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 184)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 89)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 185)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 90)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 186)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 91)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 187)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 92)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 188)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 93)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 189)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 94)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 190)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 95)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 191)]))
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[threadIdx.x_1] = @tir.if_then_else(((7 <= floormod(threadIdx.x_1, 63)) && (floormod(threadIdx.x_1, 63) < 56)), data[(((cse_var_2 + (floordiv(threadIdx.x_1, 63)*49)) + floormod(threadIdx.x_1, 63)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else(((1 <= floormod((floordiv(threadIdx.x_1, 7) + 2), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 56), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 2), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((1 <= floormod((floordiv(threadIdx.x_1, 7) + 4), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 112), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 4), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else(((1 <= floormod((floordiv(threadIdx.x_1, 7) + 6), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 168), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 6), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1568)] = @tir.if_then_else(((1 <= floormod((floordiv(threadIdx.x_1, 7) + 8), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 224), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 8), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_1 < 56), dtype=bool) {
+ pad_temp.shared_1[(threadIdx.x_1 + 1960)] = @tir.if_then_else((floormod((floordiv(threadIdx.x_1, 7) + 1), 9) < 8), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 280), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 1), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ }
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[threadIdx.x_2] = kernel[(((((blockIdx.x*73728) + (floordiv(threadIdx.x_2, 96)*4608)) + cse_var_1) + (floormod(threadIdx.x_2, 96)*3)) + 1)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 49), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 8), 96), 3)*9)) + (floormod((threadIdx.x_2 + 2), 3)*3)) + 1)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 784)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 98), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 16), 96), 3)*9)) + (floormod((threadIdx.x_2 + 1), 3)*3)) + 1)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_2 < 360), dtype=bool) {
+ kernel.shared_1[(threadIdx.x_2 + 1176)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 147), 12)*4608)) + cse_var_1) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 32)*9)) + (floormod(threadIdx.x_2, 3)*3)) + 1)]
+ }
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[(floordiv(threadIdx.x, 49)*192)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 96)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 1)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 97)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 2)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 98)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 3)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 99)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 4)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 100)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 5)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 101)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 6)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 102)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 7)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 103)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 8)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 104)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 9)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 105)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 10)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 106)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 11)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 107)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 12)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 108)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 13)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 109)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 14)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 110)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 15)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 111)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 16)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 112)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 17)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 113)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 18)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 114)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 19)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 115)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 20)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 116)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 21)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 117)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 22)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 118)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 23)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 119)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 24)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 120)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 25)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 121)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 26)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 122)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 27)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 123)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 28)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 124)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 29)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 125)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 30)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 126)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 31)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 127)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 32)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 128)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 33)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 129)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 34)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 130)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 35)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 131)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 36)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 132)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 37)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 133)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 38)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 134)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 39)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 135)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 40)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 136)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 41)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 137)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 42)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 138)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 43)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 139)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 44)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 140)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 45)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 141)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 46)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 142)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 47)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 143)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 48)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 144)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 49)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 145)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 50)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 146)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 51)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 147)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 52)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 148)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 53)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 149)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 54)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 150)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 55)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 151)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 56)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 152)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 57)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 153)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 58)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 154)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 59)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 155)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 60)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 156)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 61)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 157)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 62)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 158)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 63)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 159)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 64)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 160)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 65)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 161)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 66)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 162)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 67)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 163)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 68)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 164)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 69)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 165)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 70)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 166)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 71)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 167)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 72)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 168)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 73)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 169)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 74)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 170)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 75)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 171)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 76)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 172)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 77)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 173)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 78)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 174)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 79)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 175)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 80)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 176)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 81)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 177)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 82)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 178)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 83)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 179)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 84)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 180)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 85)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 181)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 86)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 182)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 87)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 183)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 88)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 184)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 89)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 185)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 90)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 186)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 91)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 187)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 92)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 188)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 93)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 189)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 94)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 190)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 95)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 191)]))
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[threadIdx.x_1] = @tir.if_then_else((((7 <= floormod(threadIdx.x_1, 63)) && (floormod(threadIdx.x_1, 63) < 56)) && (floormod(threadIdx.x_1, 7) < 6)), data[(((cse_var_2 + (floordiv(threadIdx.x_1, 63)*49)) + floormod(threadIdx.x_1, 63)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 2), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 2), 9) < 8)) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 56), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 2), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 4), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 4), 9) < 8)) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 112), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 4), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 6), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 6), 9) < 8)) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 168), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 6), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1568)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 8), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 8), 9) < 8)) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 224), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 8), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_1 < 56), dtype=bool) {
+ pad_temp.shared_1[(threadIdx.x_1 + 1960)] = @tir.if_then_else(((floormod((floordiv(threadIdx.x_1, 7) + 1), 9) < 8) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 280), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 1), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ }
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[threadIdx.x_2] = kernel[(((((blockIdx.x*73728) + (floordiv(threadIdx.x_2, 96)*4608)) + cse_var_1) + (floormod(threadIdx.x_2, 96)*3)) + 2)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 49), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 8), 96), 3)*9)) + (floormod((threadIdx.x_2 + 2), 3)*3)) + 2)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 784)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 98), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 16), 96), 3)*9)) + (floormod((threadIdx.x_2 + 1), 3)*3)) + 2)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_2 < 360), dtype=bool) {
+ kernel.shared_1[(threadIdx.x_2 + 1176)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 147), 12)*4608)) + cse_var_1) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 32)*9)) + (floormod(threadIdx.x_2, 3)*3)) + 2)]
+ }
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[(floordiv(threadIdx.x, 49)*192)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 96)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 1)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 97)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 2)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 98)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 3)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 99)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 4)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 100)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 5)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 101)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 6)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 102)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 7)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 103)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 8)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 104)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 9)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 105)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 10)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 106)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 11)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 107)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 12)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 108)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 13)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 109)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 14)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 110)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 15)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 111)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 16)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 112)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 17)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 113)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 18)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 114)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 19)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 115)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 20)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 116)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 21)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 117)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 22)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 118)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 23)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 119)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 24)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 120)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 25)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 121)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 26)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 122)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 27)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 123)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 28)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 124)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 29)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 125)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 30)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 126)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 31)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 127)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 32)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 128)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 33)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 129)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 34)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 130)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 35)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 131)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 36)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 132)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 37)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 133)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 38)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 134)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 39)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 135)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 40)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 136)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 41)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 137)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 42)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 138)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 43)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 139)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 44)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 140)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 45)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 141)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 46)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 142)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 47)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 143)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 48)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 144)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 49)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 145)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 50)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 146)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 51)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 147)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 52)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 148)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 53)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 149)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 54)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 150)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 55)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 151)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 56)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 152)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 57)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 153)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 58)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 154)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 59)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 155)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 60)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 156)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 61)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 157)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 62)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 158)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 63)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 159)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 64)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 160)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 65)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 161)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 66)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 162)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 67)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 163)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 68)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 164)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 69)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 165)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 70)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 166)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 71)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 167)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 72)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 168)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 73)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 169)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 74)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 170)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 75)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 171)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 76)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 172)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 77)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 173)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 78)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 174)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 79)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 175)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 80)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 176)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 81)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 177)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 82)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 178)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 83)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 179)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 84)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 180)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 85)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 181)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 86)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 182)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 87)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 183)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 88)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 184)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 89)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 185)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 90)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 186)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 91)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 187)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 92)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 188)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 93)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 189)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 94)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 190)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 95)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 191)]))
}
}
for (i1.inner: int32, 0, 2) {
- for (i3.inner: int32, 0, 7) {
- compute[(((((floordiv(blockIdx.x, 7)*6272) + (threadIdx.x*98)) + (i1.inner*49)) + (floormod(blockIdx.x, 7)*7)) + i3.inner)] = max((conv2d_nchw_1[((i1.inner*7) + i3.inner)] + bias[(((floordiv(blockIdx.x, 7)*128) + (threadIdx.x*2)) + i1.inner)]), 0f32)
- }
+ compute[((((blockIdx.x*784) + (floordiv(threadIdx.x, 49)*98)) + (i1.inner*49)) + floormod(threadIdx.x, 49))] = max((conv2d_nchw_1[i1.inner] + bias[(((blockIdx.x*16) + (floordiv(threadIdx.x, 49)*2)) + i1.inner)]), 0f32)
}
}
}
@@ -751,7 +937,7 @@ We build the binary and check its correctness and performance.
.. code-block:: none
- Execution time of this operator: 0.362 ms
+ Execution time of this operator: 0.274 ms
@@ -797,34 +983,34 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=2)
- conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=64)
+ conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=8)
conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
- conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=1)
+ conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7)
conv2d_nchw_yy_o_o_o_o, conv2d_nchw_yy_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_o_i, factor=1)
conv2d_nchw_xx_o_i, conv2d_nchw_xx_i = s[conv2d_nchw].split(conv2d_nchw_xx, factor=1)
- conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=7)
- conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=1)
+ conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
+ conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
- conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=2)
- conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=4)
+ conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=1)
+ conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=32)
conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
- conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
+ conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=3)
conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
- conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
+ conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2 [...]
compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=2)
- compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=64)
+ compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=8)
compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
- compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1)
+ compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
- compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7)
- compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
+ compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
+ compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=7)
compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
s[conv2d_nchw].compute_at(s[compute], compute_i3_o_i)
@@ -844,14 +1030,14 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
- kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64)
+ kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=392)
s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
- pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=4)
+ pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
- pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64)
+ pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=392)
s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
- s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 512)
+ s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 1024)
s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True)
CUDA source code:
@@ -869,430 +1055,640 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
#define int64_t long long
#define uint64_t unsigned long long
#endif
- extern "C" __global__ void __launch_bounds__(64) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
- float conv2d_nchw[14];
- __shared__ float pad_temp_shared[72];
- __shared__ float kernel_shared[3072];
+ extern "C" __global__ void __launch_bounds__(392) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
+ float conv2d_nchw[2];
+ __shared__ float pad_temp_shared[2016];
+ __shared__ float kernel_shared[1536];
conv2d_nchw[0] = 0.000000e+00f;
conv2d_nchw[1] = 0.000000e+00f;
- conv2d_nchw[2] = 0.000000e+00f;
- conv2d_nchw[3] = 0.000000e+00f;
- conv2d_nchw[4] = 0.000000e+00f;
- conv2d_nchw[5] = 0.000000e+00f;
- conv2d_nchw[6] = 0.000000e+00f;
- conv2d_nchw[7] = 0.000000e+00f;
- conv2d_nchw[8] = 0.000000e+00f;
- conv2d_nchw[9] = 0.000000e+00f;
- conv2d_nchw[10] = 0.000000e+00f;
- conv2d_nchw[11] = 0.000000e+00f;
- conv2d_nchw[12] = 0.000000e+00f;
- conv2d_nchw[13] = 0.000000e+00f;
- for (int rc_outer_outer = 0; rc_outer_outer < 64; ++rc_outer_outer) {
- for (int ry_outer_outer = 0; ry_outer_outer < 3; ++ry_outer_outer) {
- __syncthreads();
- if (((int)threadIdx.x) < 18) {
- pad_temp_shared[(((int)threadIdx.x) * 4)] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= ((((int)threadIdx.x) * 4) % 9))) && (((((int)threadIdx.x) * 4) % 9) < 8)) ? data[((((((rc_outer_outer * 392) + (((((int)threadIdx.x) * 4) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + ((((int)threadIdx.x) * 4) % 9)) - 8)] : 0.000000e+00f);
- }
- if (((int)threadIdx.x) < 18) {
- pad_temp_shared[((((int)threadIdx.x) * 4) + 1)] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 1) % 9))) && ((((((int)threadIdx.x) * 4) + 1) % 9) < 8)) ? data[((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 1) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 1) % 9)) - 8)] : 0.000000e+00f);
- }
- if (((int)threadIdx.x) < 18) {
- pad_temp_shared[((((int)threadIdx.x) * 4) + 2)] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 2) % 9))) && ((((((int)threadIdx.x) * 4) + 2) % 9) < 8)) ? data[((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 2) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 2) % 9)) - 8)] : 0.000000e+00f);
- }
- if (((int)threadIdx.x) < 18) {
- pad_temp_shared[((((int)threadIdx.x) * 4) + 3)] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 3) % 9))) && ((((((int)threadIdx.x) * 4) + 3) % 9) < 8)) ? data[((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 3) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 3) % 9)) - 8)] : 0.000000e+00f);
- }
- kernel_shared[((int)threadIdx.x)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
- kernel_shared[(((int)threadIdx.x) + 64)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 64) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 128)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 128) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 192)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36864)];
- kernel_shared[(((int)threadIdx.x) + 256)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 256) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 320)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 320) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 384)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 73728)];
- kernel_shared[(((int)threadIdx.x) + 448)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 448) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 512)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 512) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 576)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 110592)];
- kernel_shared[(((int)threadIdx.x) + 640)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 640) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 704)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 704) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 768)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 147456)];
- kernel_shared[(((int)threadIdx.x) + 832)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 832) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 896)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 896) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 960)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 184320)];
- kernel_shared[(((int)threadIdx.x) + 1024)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1024) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1088)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1088) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1152)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 221184)];
- kernel_shared[(((int)threadIdx.x) + 1216)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1216) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1280)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1280) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1344)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 258048)];
- kernel_shared[(((int)threadIdx.x) + 1408)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1408) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1472)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1472) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1536)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 294912)];
- kernel_shared[(((int)threadIdx.x) + 1600)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1600) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1664)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1664) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1728)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 331776)];
- kernel_shared[(((int)threadIdx.x) + 1792)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1792) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1856)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1856) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1920)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 368640)];
- kernel_shared[(((int)threadIdx.x) + 1984)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1984) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2048)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2048) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2112)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 405504)];
- kernel_shared[(((int)threadIdx.x) + 2176)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2176) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2240)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2240) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2304)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 442368)];
- kernel_shared[(((int)threadIdx.x) + 2368)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2368) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2432)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2432) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2496)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 479232)];
- kernel_shared[(((int)threadIdx.x) + 2560)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2560) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2624)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2624) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2688)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 516096)];
- kernel_shared[(((int)threadIdx.x) + 2752)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2752) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2816)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2816) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2880)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 552960)];
- kernel_shared[(((int)threadIdx.x) + 2944)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2944) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 3008)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 3008) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- __syncthreads();
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[0] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[9] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[1] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[10] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[2] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[3] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[4] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[5] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[6] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[0] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[9] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[1] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[10] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[1] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[10] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[7] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[16] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[1] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[10] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[7] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[16] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[7] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[16] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[8] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[17] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[7] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[16] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[8] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[17] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[18] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[27] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[19] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[28] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[18] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[27] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[19] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[28] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[19] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[28] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[25] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[34] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[19] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[28] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[25] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[34] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[25] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[34] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[26] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[35] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[25] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[34] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[26] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[35] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[36] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[45] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[37] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[46] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[36] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[45] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[37] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[46] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[37] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[46] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[43] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[52] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[37] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[46] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[43] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[52] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[43] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[52] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[44] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[53] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[43] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[52] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[44] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[53] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[54] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[63] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[55] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[64] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[54] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[63] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[55] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[64] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[55] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[64] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[61] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[70] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[55] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[64] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[61] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[70] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[61] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[70] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[62] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[71] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[61] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[70] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[62] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[71] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
+ for (int rc_outer_outer = 0; rc_outer_outer < 16; ++rc_outer_outer) {
+ __syncthreads();
+ pad_temp_shared[((int)threadIdx.x)] = ((((7 <= (((int)threadIdx.x) % 63)) && ((((int)threadIdx.x) % 63) < 56)) && (1 <= (((int)threadIdx.x) % 7))) ? data[((((rc_outer_outer * 1568) + ((((int)threadIdx.x) / 63) * 49)) + (((int)threadIdx.x) % 63)) - 8)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 392)] = ((((1 <= (((((int)threadIdx.x) / 7) + 2) % 9)) && ((((((int)threadIdx.x) / 7) + 2) % 9) < 8)) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 392) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 2) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 784)] = ((((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 784) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1176)] = ((((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1176) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1568)] = ((((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1568) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ if (((int)threadIdx.x) < 56) {
+ pad_temp_shared[(((int)threadIdx.x) + 1960)] = (((((int)threadIdx.x) < 49) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1960) / 63) * 49)) + (((((int)threadIdx.x) / 7) + 1) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ }
+ kernel_shared[((int)threadIdx.x)] = kernel[((((((int)blockIdx.x) * 73728) + ((((int)threadIdx.x) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((int)threadIdx.x) % 96) * 3))];
+ kernel_shared[(((int)threadIdx.x) + 392)] = kernel[(((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 8) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 2) % 3) * 3))];
+ kernel_shared[(((int)threadIdx.x) + 784)] = kernel[(((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 784) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 16) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 1) % 3) * 3))];
+ if (((int)threadIdx.x) < 360) {
+ kernel_shared[(((int)threadIdx.x) + 1176)] = kernel[(((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 1176) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) / 3) + 8) & 31) * 9)) + ((((int)threadIdx.x) % 3) * 3))];
+ }
+ __syncthreads();
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[((((int)threadIdx.x) / 49) * 192)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 96)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 1)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 97)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 2)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 98)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 3)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 99)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 4)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 100)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 5)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 101)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 6)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 102)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 7)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 103)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 8)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 104)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 9)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 105)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 10)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 106)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 11)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 107)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 12)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 108)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 13)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 109)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 14)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 110)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 15)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 111)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 16)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 112)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 17)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 113)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 18)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 114)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 19)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 115)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 20)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 116)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 21)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 117)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 22)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 118)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 23)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 119)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 24)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 120)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 25)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 121)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 26)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 122)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 27)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 123)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 28)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 124)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 29)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 125)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 30)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 126)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 31)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 127)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 32)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 128)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 33)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 129)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 34)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 130)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 35)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 131)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 36)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 132)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 37)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 133)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 38)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 134)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 39)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 135)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 40)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 136)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 41)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 137)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 42)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 138)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 43)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 139)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 44)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 140)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 45)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 141)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 46)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 142)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 47)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 143)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 48)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 144)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 49)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 145)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 50)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 146)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 51)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 147)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 52)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 148)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 53)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 149)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 54)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 150)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 55)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 151)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 56)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 152)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 57)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 153)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 58)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 154)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 59)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 155)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 60)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 156)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 61)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 157)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 62)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 158)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 63)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 159)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 64)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 160)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 65)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 161)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 66)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 162)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 67)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 163)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 68)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 164)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 69)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 165)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 70)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 166)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 71)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 167)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 72)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 168)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 73)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 169)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 74)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 170)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 75)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 171)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 76)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 172)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 77)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 173)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 78)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 174)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 79)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 175)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 80)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 176)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 81)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 177)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 82)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 178)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 83)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 179)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 84)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 180)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 85)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 181)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 86)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 182)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 87)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 183)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 88)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 184)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 89)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 185)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 90)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 186)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 91)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 187)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 92)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 188)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 93)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 189)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 94)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 190)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 95)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 191)]));
+ __syncthreads();
+ pad_temp_shared[((int)threadIdx.x)] = (((7 <= (((int)threadIdx.x) % 63)) && ((((int)threadIdx.x) % 63) < 56)) ? data[((((rc_outer_outer * 1568) + ((((int)threadIdx.x) / 63) * 49)) + (((int)threadIdx.x) % 63)) - 7)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 392)] = (((1 <= (((((int)threadIdx.x) / 7) + 2) % 9)) && ((((((int)threadIdx.x) / 7) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 392) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 2) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 784)] = (((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 784) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1176)] = (((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1176) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1568)] = (((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1568) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
+ if (((int)threadIdx.x) < 56) {
+ pad_temp_shared[(((int)threadIdx.x) + 1960)] = ((((int)threadIdx.x) < 49) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1960) / 63) * 49)) + (((((int)threadIdx.x) / 7) + 1) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
}
+ kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 73728) + ((((int)threadIdx.x) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((int)threadIdx.x) % 96) * 3)) + 1)];
+ kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 8) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 2) % 3) * 3)) + 1)];
+ kernel_shared[(((int)threadIdx.x) + 784)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 784) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 16) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 1) % 3) * 3)) + 1)];
+ if (((int)threadIdx.x) < 360) {
+ kernel_shared[(((int)threadIdx.x) + 1176)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 1176) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) / 3) + 8) & 31) * 9)) + ((((int)threadIdx.x) % 3) * 3)) + 1)];
+ }
+ __syncthreads();
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[((((int)threadIdx.x) / 49) * 192)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 96)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 1)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 97)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 2)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 98)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 3)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 99)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 4)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 100)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 5)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 101)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 6)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 102)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 7)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 103)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 8)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 104)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 9)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 105)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 10)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 106)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 11)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 107)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 12)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 108)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 13)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 109)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 14)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 110)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 15)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 111)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 16)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 112)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 17)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 113)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 18)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 114)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 19)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 115)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 20)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 116)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 21)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 117)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 22)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 118)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 23)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 119)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 24)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 120)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 25)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 121)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 26)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 122)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 27)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 123)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 28)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 124)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 29)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 125)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 30)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 126)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 31)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 127)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 32)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 128)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 33)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 129)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 34)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 130)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 35)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 131)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 36)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 132)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 37)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 133)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 38)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 134)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 39)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 135)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 40)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 136)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 41)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 137)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 42)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 138)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 43)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 139)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 44)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 140)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 45)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 141)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 46)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 142)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 47)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 143)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 48)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 144)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 49)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 145)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 50)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 146)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 51)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 147)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 52)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 148)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 53)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 149)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 54)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 150)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 55)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 151)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 56)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 152)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 57)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 153)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 58)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 154)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 59)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 155)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 60)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 156)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 61)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 157)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 62)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 158)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 63)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 159)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 64)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 160)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 65)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 161)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 66)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 162)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 67)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 163)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 68)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 164)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 69)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 165)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 70)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 166)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 71)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 167)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 72)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 168)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 73)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 169)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 74)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 170)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 75)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 171)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 76)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 172)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 77)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 173)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 78)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 174)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 79)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 175)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 80)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 176)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 81)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 177)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 82)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 178)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 83)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 179)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 84)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 180)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 85)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 181)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 86)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 182)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 87)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 183)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 88)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 184)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 89)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 185)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 90)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 186)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 91)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 187)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 92)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 188)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 93)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 189)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 94)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 190)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 95)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 191)]));
+ __syncthreads();
+ pad_temp_shared[((int)threadIdx.x)] = ((((7 <= (((int)threadIdx.x) % 63)) && ((((int)threadIdx.x) % 63) < 56)) && ((((int)threadIdx.x) % 7) < 6)) ? data[((((rc_outer_outer * 1568) + ((((int)threadIdx.x) / 63) * 49)) + (((int)threadIdx.x) % 63)) - 6)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 392)] = ((((1 <= (((((int)threadIdx.x) / 7) + 2) % 9)) && ((((((int)threadIdx.x) / 7) + 2) % 9) < 8)) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 392) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 2) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 784)] = ((((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 784) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1176)] = ((((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1176) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1568)] = ((((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1568) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ if (((int)threadIdx.x) < 56) {
+ pad_temp_shared[(((int)threadIdx.x) + 1960)] = (((((int)threadIdx.x) < 49) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1960) / 63) * 49)) + (((((int)threadIdx.x) / 7) + 1) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ }
+ kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 73728) + ((((int)threadIdx.x) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((int)threadIdx.x) % 96) * 3)) + 2)];
+ kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 8) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 2) % 3) * 3)) + 2)];
+ kernel_shared[(((int)threadIdx.x) + 784)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 784) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 16) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 1) % 3) * 3)) + 2)];
+ if (((int)threadIdx.x) < 360) {
+ kernel_shared[(((int)threadIdx.x) + 1176)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 1176) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) / 3) + 8) & 31) * 9)) + ((((int)threadIdx.x) % 3) * 3)) + 2)];
+ }
+ __syncthreads();
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[((((int)threadIdx.x) / 49) * 192)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 96)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 1)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 97)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 2)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 98)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 3)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 99)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 4)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 100)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 5)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 101)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 6)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 102)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 7)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 103)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 8)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 104)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 9)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 105)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 10)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 106)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 11)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 107)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 12)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 108)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 13)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 109)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 14)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 110)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 15)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 111)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 16)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 112)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 17)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 113)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 18)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 114)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 19)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 115)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 20)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 116)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 21)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 117)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 22)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 118)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 23)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 119)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 24)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 120)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 25)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 121)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 26)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 122)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 27)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 123)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 28)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 124)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 29)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 125)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 30)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 126)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 31)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 127)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 32)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 128)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 33)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 129)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 34)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 130)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 35)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 131)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 36)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 132)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 37)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 133)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 38)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 134)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 39)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 135)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 40)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 136)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 41)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 137)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 42)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 138)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 43)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 139)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 44)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 140)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 45)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 141)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 46)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 142)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 47)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 143)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 48)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 144)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 49)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 145)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 50)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 146)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 51)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 147)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 52)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 148)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 53)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 149)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 54)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 150)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 55)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 151)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 56)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 152)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 57)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 153)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 58)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 154)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 59)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 155)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 60)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 156)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 61)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 157)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 62)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 158)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 63)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 159)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 64)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 160)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 65)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 161)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 66)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 162)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 67)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 163)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 68)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 164)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 69)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 165)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 70)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 166)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 71)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 167)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 72)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 168)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 73)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 169)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 74)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 170)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 75)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 171)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 76)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 172)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 77)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 173)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 78)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 174)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 79)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 175)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 80)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 176)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 81)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 177)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 82)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 178)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 83)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 179)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 84)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 180)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 85)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 181)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 86)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 182)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 87)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 183)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 88)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 184)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 89)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 185)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 90)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 186)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 91)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 187)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 92)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 188)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 93)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 189)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 94)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 190)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 95)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 191)]));
}
for (int i1_inner = 0; i1_inner < 2; ++i1_inner) {
- for (int i3_inner = 0; i3_inner < 7; ++i3_inner) {
- compute[((((((((int)blockIdx.x) / 7) * 6272) + (((int)threadIdx.x) * 98)) + (i1_inner * 49)) + ((((int)blockIdx.x) % 7) * 7)) + i3_inner)] = max((conv2d_nchw[((i1_inner * 7) + i3_inner)] + bias[((((((int)blockIdx.x) / 7) * 128) + (((int)threadIdx.x) * 2)) + i1_inner)]), 0.000000e+00f);
- }
+ compute[((((((int)blockIdx.x) * 784) + ((((int)threadIdx.x) / 49) * 98)) + (i1_inner * 49)) + (((int)threadIdx.x) % 49))] = max((conv2d_nchw[i1_inner] + bias[(((((int)blockIdx.x) * 16) + ((((int)threadIdx.x) / 49) * 2)) + i1_inner)]), 0.000000e+00f);
}
}
@@ -1351,7 +1747,7 @@ In the example below we resume the status and do more 5 trials.
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 2 minutes 32.857 seconds)
+ **Total running time of the script:** ( 2 minutes 42.124 seconds)
.. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
index 4f36ff57b..f42f98357 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
@@ -616,7 +616,7 @@ so we can read the log file and load the best schedules.
Evaluate inference time cost...
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 9.9299 9.9418 9.9527 9.8950 0.0250
+ 9.5497 9.5434 9.5722 9.5336 0.0164
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
index 2598a8241..fcd160862 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
@@ -635,7 +635,7 @@ so we can read the log file and load the best schedules.
Evaluate inference time cost...
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 759.8272 759.6222 760.7100 759.1494 0.6534
+ 760.3963 760.2333 761.3582 759.5974 0.7280
@@ -660,7 +660,7 @@ Other Tips
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 1 minutes 21.762 seconds)
+ **Total running time of the script:** ( 1 minutes 21.321 seconds)
.. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_x86.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
index 4cf020ee8..d22cd46e2 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
@@ -362,30 +362,76 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [65536], []),
compute: Buffer(compute_2: Pointer(float32), float32, [65536], [])}
buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute}
- preflattened_buffer_map = {placeholder_7: placeholder_15: Buffer(placeholder_12, int32, [4916], []), placeholder_9: placeholder_16: Buffer(placeholder_14, float32, [128, 512], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_6: placeholder_17: Buffer(placeholder_11, float32, [4916, 16, 1], []), placeholder_8: placeholder_18: Buffer(placeholder_13, int32, [33], []), placeholder_5: placeholder_19: Buffer(placeholder_10, float32, [128, 256], [])} {
- for (i0.outer.i1.outer.fused: int32, 0, 128) "parallel" {
- allocate(compute_4: Pointer(global float32), float32, [512]), storage_scope = global {
- for (i.outer.inner: int32, 0, 2) {
- for (nb_j.inner: int32, 0, 2) {
- for (i.inner.init: int32, 0, 8) {
- for (j.init: int32, 0, 16) {
- compute_5: Buffer(compute_4, float32, [512], [])[((((i.outer.inner*256) + (i.inner.init*32)) + (nb_j.inner*16)) + j.init)] = 0f32
- }
+ preflattened_buffer_map = {placeholder_7: placeholder_15: Buffer(placeholder_12, int32, [4916], []), placeholder_5: placeholder_16: Buffer(placeholder_10, float32, [128, 256], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_9: placeholder_17: Buffer(placeholder_14, float32, [128, 512], []), placeholder_8: placeholder_18: Buffer(placeholder_13, int32, [33], []), placeholder_6: placeholder_19: Buffer(placeholder_11, float32, [4916, 16, 1], [])} {
+ for (i0.outer.i1.outer.fused: int32, 0, 64) "parallel" {
+ allocate(compute_4: Pointer(global float32), float32, [1024]), storage_scope = global {
+ for (nb_j.inner: int32, 0, 2) {
+ for (i.inner.init: int32, 0, 32) {
+ let cse_var_1: int32 = ((i.inner.init*32) + (nb_j.inner*16))
+ {
+ compute_5: Buffer(compute_4, float32, [1024], [])[cse_var_1] = 0f32
+ compute_5[(cse_var_1 + 1)] = 0f32
+ compute_5[(cse_var_1 + 2)] = 0f32
+ compute_5[(cse_var_1 + 3)] = 0f32
+ compute_5[(cse_var_1 + 4)] = 0f32
+ compute_5[(cse_var_1 + 5)] = 0f32
+ compute_5[(cse_var_1 + 6)] = 0f32
+ compute_5[(cse_var_1 + 7)] = 0f32
+ compute_5[(cse_var_1 + 8)] = 0f32
+ compute_5[(cse_var_1 + 9)] = 0f32
+ compute_5[(cse_var_1 + 10)] = 0f32
+ compute_5[(cse_var_1 + 11)] = 0f32
+ compute_5[(cse_var_1 + 12)] = 0f32
+ compute_5[(cse_var_1 + 13)] = 0f32
+ compute_5[(cse_var_1 + 14)] = 0f32
+ compute_5[(cse_var_1 + 15)] = 0f32
}
- for (elem_idx: int32, 0, let cse_var_1: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner) in (placeholder_3[(cse_var_1 + 1)] - placeholder_3[cse_var_1])) {
- for (i.inner: int32, 0, 8) {
- for (j: int32, 0, 16) {
- let cse_var_3: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)
- let cse_var_2: int32 = ((((i.outer.inner*256) + (i.inner*32)) + (nb_j.inner*16)) + j)
- compute_5[cse_var_2] = (compute_5[cse_var_2] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + j)]*max(placeholder[((((floordiv(i0.outer.i1.outer.fused, 16)*4096) + (i.outer.inner*2048)) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
- }
+ }
+ for (elem_idx: int32, 0, let cse_var_2: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
+ for (i.inner: int32, 0, 32) {
+ let cse_var_21: int32 = (elem_idx*16)
+ let cse_var_20: int32 = ((i.inner*32) + (nb_j.inner*16))
+ let cse_var_19: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)
+ let cse_var_18: int32 = ((floordiv(i0.outer.i1.outer.fused, 16)*8192) + (i.inner*256))
+ let cse_var_17: int32 = (cse_var_20 + 9)
+ let cse_var_16: int32 = (cse_var_20 + 8)
+ let cse_var_15: int32 = (cse_var_20 + 7)
+ let cse_var_14: int32 = (cse_var_20 + 6)
+ let cse_var_13: int32 = (cse_var_20 + 5)
+ let cse_var_12: int32 = (cse_var_20 + 4)
+ let cse_var_11: int32 = (cse_var_20 + 3)
+ let cse_var_10: int32 = (cse_var_20 + 2)
+ let cse_var_9: int32 = (cse_var_20 + 15)
+ let cse_var_8: int32 = (cse_var_20 + 14)
+ let cse_var_7: int32 = (cse_var_20 + 13)
+ let cse_var_6: int32 = (cse_var_20 + 12)
+ let cse_var_5: int32 = (cse_var_20 + 11)
+ let cse_var_4: int32 = (cse_var_20 + 10)
+ let cse_var_3: int32 = (cse_var_20 + 1)
+ {
+ compute_5[cse_var_20] = (compute_5[cse_var_20] + (placeholder_1[((placeholder_3[cse_var_19]*16) + cse_var_21)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_3] = (compute_5[cse_var_3] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 1)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_10] = (compute_5[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 2)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_11] = (compute_5[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 3)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_12] = (compute_5[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 4)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_13] = (compute_5[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 5)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_14] = (compute_5[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 6)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_15] = (compute_5[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 7)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_16] = (compute_5[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 8)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_17] = (compute_5[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 9)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_4] = (compute_5[cse_var_4] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 10)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_5] = (compute_5[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 11)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_6] = (compute_5[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 12)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_7] = (compute_5[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 13)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_8] = (compute_5[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 14)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_9] = (compute_5[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 15)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
}
}
}
}
- for (i0.inner: int32, 0, 16) {
- let cse_var_4: int32 = (((floordiv(i0.outer.i1.outer.fused, 16)*8192) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32))
- compute[ramp(cse_var_4, 1, 32)] = max((compute_5[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_4, 1, 32)]), broadcast(0f32, 32))
+ for (i0.inner: int32, 0, 32) {
+ let cse_var_22: int32 = (((floordiv(i0.outer.i1.outer.fused, 16)*16384) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32))
+ compute[ramp(cse_var_22, 1, 32)] = max((compute_5[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_22, 1, 32)]), broadcast(0f32, 32))
}
}
}
@@ -439,7 +485,7 @@ We build the binary and check its correctness and performance.
.. code-block:: none
- Execution time of this operator: 1.571 ms
+ Execution time of this operator: 1.743 ms
diff --git a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
index 383b6d3bd..a502fef22 100644
--- a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
Computation times
=================
-**00:44.102** total execution time for **how_to_tune_with_autotvm** files:
+**00:45.393** total execution time for **how_to_tune_with_autotvm** files:
-- **00:43.179**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)
-- **00:00.244**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)
-- **00:00.227**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)
-- **00:00.227**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``)
+- **00:44.476**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)
+- **00:00.238**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)
+- **00:00.229**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)
- **00:00.226**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)
+- **00:00.225**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``)
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
index 898a9d35a..331599f68 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
@@ -859,8 +859,8 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 4, 4, 32]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2885496
- No: 6 GFLOPS: 94.82/94.82 result: MeasureResult(costs=(0.0024414937291666666,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6543385982513428, timestamp=1654935570.08604) [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
- No: 7 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 6 GFLOPS: 110.87/110.87 result: MeasureResult(costs=(0.002088087645833333,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8702428340911865, timestamp=1654980075.9523664) [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
+ No: 7 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -983,7 +983,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 1, 16, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 256, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6225319
- No: 8 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 8 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1106,7 +1106,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 2, 1, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 8, 64]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,943546
- No: 9 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 9 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1229,7 +1229,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 4, 16, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 16, 32]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2868708
- No: 10 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 10 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 142, in build
res = future.result()
File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
@@ -1247,7 +1247,7 @@ for this template
TimeoutError
[('tile_f', [-1, 32, 2, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4691833
- No: 11 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 11 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1370,7 +1370,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 1, 2, 64]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,1042124
- No: 12 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 12 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1493,7 +1493,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 32, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 32, 16]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10013405
- No: 13 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 13 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1616,7 +1616,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 8, 8, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 32]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6732082
- No: 14 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 14 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1739,7 +1739,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 2, 4, 32]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7536735
- No: 15 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 15 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1862,7 +1862,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 2, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 128, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,482121
- No: 16 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 16 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1985,7 +1985,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 2, 1, 16]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 32, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2824525
- No: 17 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 17 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -2108,7 +2108,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 64, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 8, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4559286
- No: 18 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 18 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -2231,7 +2231,7 @@ for this template
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 1, 32, 16]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 512]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9677544
- No: 19 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+ No: 19 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 721, in __call__
yield remote, remote.load_module(os.path.split(build_result.filename)[1])
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 685, in run_through_rpc
@@ -2319,7 +2319,7 @@ for this template
15: _PyEval_EvalFrameDefault
14: 0x0000000000537c30
13: _PyObject_FastCallKeywords
- 12: 0x00007f49fea61fa2
+ 12: 0x00007fc67a97dfa2
11: _ctypes_callproc
10: ffi_call
9: ffi_call_unix64
@@ -2384,7 +2384,7 @@ for this template
21: _PyFunction_FastCallKeywords
20: _PyEval_EvalFrameDefault
19: _PyFunction_FastCall [('tile_f', [-1, 8, 2, 16]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6390073
- No: 20 GFLOPS: 144.91/144.91 result: MeasureResult(costs=(0.0015975758700000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4393105506896973, timestamp=1654935595.9084878) [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
+ No: 20 GFLOPS: 144.26/144.26 result: MeasureResult(costs=(0.0016047824999999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4284491539001465, timestamp=1654980101.9109926) [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
@@ -2437,7 +2437,7 @@ and measure running time.
Best config:
[('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
- Time cost of this operator: 0.002003
+ Time cost of this operator: 0.001986
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
index 544803cc4..16eb35800 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
@@ -294,10 +294,10 @@ Timing the untuned program
########## Build without Autotuning ##########
Node Name Ops Time(us) Time(%) Shape Inputs Outputs
--------- --- -------- ------- ----- ------ -------
- tvmgen_default_fused_nn_contrib_conv2d_NCHWc tvmgen_default_fused_nn_contrib_conv2d_NCHWc 313.0 98.733 (1, 2, 10, 10, 3) 2 1
- tvmgen_default_fused_layout_transform_1 tvmgen_default_fused_layout_transform_1 3.094 0.976 (1, 6, 10, 10) 1 1
- tvmgen_default_fused_layout_transform tvmgen_default_fused_layout_transform 0.923 0.291 (1, 1, 10, 10, 3) 1 1
- Total_time - 317.017 - - - -
+ tvmgen_default_fused_nn_contrib_conv2d_NCHWc tvmgen_default_fused_nn_contrib_conv2d_NCHWc 313.0 98.746 (1, 2, 10, 10, 3) 2 1
+ tvmgen_default_fused_layout_transform_1 tvmgen_default_fused_layout_transform_1 3.073 0.969 (1, 6, 10, 10) 1 1
+ tvmgen_default_fused_layout_transform tvmgen_default_fused_layout_transform 0.904 0.285 (1, 1, 10, 10, 3) 1 1
+ Total_time - 316.976 - - - -
@@ -359,10 +359,10 @@ Timing the tuned program
########## Build with Autotuning ##########
Node Name Ops Time(us) Time(%) Shape Inputs Outputs
--------- --- -------- ------- ----- ------ -------
- tvmgen_default_fused_nn_contrib_conv2d_NCHWc tvmgen_default_fused_nn_contrib_conv2d_NCHWc 208.1 98.757 (1, 6, 10, 10, 1) 2 1
- tvmgen_default_fused_layout_transform_1 tvmgen_default_fused_layout_transform_1 1.753 0.832 (1, 6, 10, 10) 1 1
- tvmgen_default_fused_layout_transform tvmgen_default_fused_layout_transform 0.866 0.411 (1, 3, 10, 10, 1) 1 1
- Total_time - 210.719 - - - -
+ tvmgen_default_fused_nn_contrib_conv2d_NCHWc tvmgen_default_fused_nn_contrib_conv2d_NCHWc 192.7 98.39 (1, 1, 10, 10, 6) 2 1
+ tvmgen_default_fused_layout_transform_1 tvmgen_default_fused_layout_transform_1 2.147 1.096 (1, 6, 10, 10) 1 1
+ tvmgen_default_fused_layout_transform tvmgen_default_fused_layout_transform 1.007 0.514 (1, 3, 10, 10, 1) 1 1
+ Total_time - 195.854 - - - -
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
index b1b0f3874..1cc7d81f3 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
@@ -297,8 +297,8 @@ objects to other stuff? We can display some examples from our datasets using ``m
.. code-block:: none
- /tmp/tmpgctt5uyy/images/target contains 8144 images
- /tmp/tmpgctt5uyy/images/random contains 5000 images
+ /tmp/tmpaipbfjd6/images/target contains 8144 images
+ /tmp/tmpaipbfjd6/images/random contains 5000 images
@@ -459,11 +459,11 @@ the time on our validation set).
.. code-block:: none
Epoch 1/3
- 328/328 - 54s - loss: 0.2112 - accuracy: 0.9264 - val_loss: 0.1324 - val_accuracy: 0.9569
+ 328/328 - 55s - loss: 0.2101 - accuracy: 0.9294 - val_loss: 0.1816 - val_accuracy: 0.9430
Epoch 2/3
- 328/328 - 52s - loss: 0.1006 - accuracy: 0.9627 - val_loss: 0.1318 - val_accuracy: 0.9622
+ 328/328 - 52s - loss: 0.0969 - accuracy: 0.9635 - val_loss: 0.1619 - val_accuracy: 0.9543
Epoch 3/3
- 328/328 - 52s - loss: 0.0694 - accuracy: 0.9735 - val_loss: 0.1206 - val_accuracy: 0.9596
+ 328/328 - 52s - loss: 0.0644 - accuracy: 0.9760 - val_loss: 0.1358 - val_accuracy: 0.9588
@@ -825,7 +825,7 @@ Arduino tutorial for how to do that `on GitHub <https://github.com/guberti/tvm-a
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 5 minutes 34.618 seconds)
+ **Total running time of the script:** ( 4 minutes 28.596 seconds)
.. _sphx_glr_download_how_to_work_with_microtvm_micro_train.py:
diff --git a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
index 469b5b113..4234ea94b 100644
--- a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
@@ -5,11 +5,11 @@
Computation times
=================
-**06:21.428** total execution time for **how_to_work_with_microtvm** files:
-
-- **05:34.618**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_train.py` (``micro_train.py``)
-- **00:42.438**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)
-- **00:03.751**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)
-- **00:00.209**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_tvmc.py` (``micro_tvmc.py``)
-- **00:00.207**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_ethosu.py` (``micro_ethosu.py``)
-- **00:00.206**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_reference_vm.py` (``micro_reference_vm.py``)
+**05:16.413** total execution time for **how_to_work_with_microtvm** files:
+
+- **04:28.596**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_train.py` (``micro_train.py``)
+- **00:43.445**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)
+- **00:03.755**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)
+- **00:00.208**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_tvmc.py` (``micro_tvmc.py``)
+- **00:00.205**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_ethosu.py` (``micro_ethosu.py``)
+- **00:00.203**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_reference_vm.py` (``micro_reference_vm.py``)
diff --git a/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
index 497da2cb3..85e058ad2 100644
--- a/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
@@ -5,8 +5,8 @@
Computation times
=================
-**00:06.380** total execution time for **how_to_work_with_relay** files:
+**00:12.267** total execution time for **how_to_work_with_relay** files:
-- **00:04.582**: :ref:`sphx_glr_how_to_work_with_relay_using_external_lib.py` (``using_external_lib.py``)
-- **00:01.572**: :ref:`sphx_glr_how_to_work_with_relay_build_gcn.py` (``build_gcn.py``)
-- **00:00.226**: :ref:`sphx_glr_how_to_work_with_relay_using_relay_viz.py` (``using_relay_viz.py``)
+- **00:10.199**: :ref:`sphx_glr_how_to_work_with_relay_using_external_lib.py` (``using_external_lib.py``)
+- **00:01.840**: :ref:`sphx_glr_how_to_work_with_relay_build_gcn.py` (``build_gcn.py``)
+- **00:00.228**: :ref:`sphx_glr_how_to_work_with_relay_using_relay_viz.py` (``using_relay_viz.py``)
diff --git a/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
index aeb2f3167..31414cfda 100644
--- a/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
@@ -5,13 +5,13 @@
Computation times
=================
-**00:05.316** total execution time for **how_to_work_with_schedules** files:
+**00:06.103** total execution time for **how_to_work_with_schedules** files:
-- **00:01.996**: :ref:`sphx_glr_how_to_work_with_schedules_intrin_math.py` (``intrin_math.py``)
-- **00:00.854**: :ref:`sphx_glr_how_to_work_with_schedules_tensorize.py` (``tensorize.py``)
-- **00:00.720**: :ref:`sphx_glr_how_to_work_with_schedules_reduction.py` (``reduction.py``)
-- **00:00.698**: :ref:`sphx_glr_how_to_work_with_schedules_scan.py` (``scan.py``)
-- **00:00.320**: :ref:`sphx_glr_how_to_work_with_schedules_extern_op.py` (``extern_op.py``)
-- **00:00.254**: :ref:`sphx_glr_how_to_work_with_schedules_schedule_primitives.py` (``schedule_primitives.py``)
+- **00:02.245**: :ref:`sphx_glr_how_to_work_with_schedules_intrin_math.py` (``intrin_math.py``)
+- **00:01.263**: :ref:`sphx_glr_how_to_work_with_schedules_tensorize.py` (``tensorize.py``)
+- **00:00.780**: :ref:`sphx_glr_how_to_work_with_schedules_reduction.py` (``reduction.py``)
+- **00:00.761**: :ref:`sphx_glr_how_to_work_with_schedules_scan.py` (``scan.py``)
+- **00:00.318**: :ref:`sphx_glr_how_to_work_with_schedules_extern_op.py` (``extern_op.py``)
+- **00:00.262**: :ref:`sphx_glr_how_to_work_with_schedules_schedule_primitives.py` (``schedule_primitives.py``)
- **00:00.244**: :ref:`sphx_glr_how_to_work_with_schedules_tedd.py` (``tedd.py``)
- **00:00.231**: :ref:`sphx_glr_how_to_work_with_schedules_tuple_inputs.py` (``tuple_inputs.py``)
diff --git a/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt b/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
index a99636b1a..ba9e69ab2 100644
--- a/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
@@ -318,7 +318,7 @@ The importing needs to happen before the tensorized GEMV being executed.
C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 64], []), B_1: B_3: Buffer(B_2, float32, [512, 64], []), C_1: C_3: Buffer(C_2, float32, [1024, 512], [])} {
- attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmp14y5jrep/input0.cc'\nsource_filename = \"/tmp/tmp14y5jrep/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n %7 = alloca float*, align 8\n %8 = alloca float*, align 8\n %9 = alloca floa [...]
+ attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmpin6d2mbk/input0.cc'\nsource_filename = \"/tmp/tmpin6d2mbk/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n %7 = alloca float*, align 8\n %8 = alloca float*, align 8\n %9 = alloca floa [...]
for (i, 0, 1024) {
for (j.outer: int32, 0, 32) {
@tir.call_extern("gemv_update", @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i*512) + (j.outer*16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i*64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer*1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
diff --git a/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
index bf5f93d7a..60a1ea710 100644
--- a/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
Computation times
=================
-**00:21.721** total execution time for **topic_vta_tutorials_autotvm** files:
+**00:21.411** total execution time for **topic_vta_tutorials_autotvm** files:
-- **00:21.504**: :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``)
-- **00:00.218**: :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_alu_vta.py` (``tune_alu_vta.py``)
+- **00:21.192**: :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``)
+- **00:00.219**: :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_alu_vta.py` (``tune_alu_vta.py``)
diff --git a/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
index e803e0944..42035363e 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
@@ -267,7 +267,7 @@ The compilation steps are:
DeprecationWarning,
/workspace/vta/tutorials/frontend/deploy_classification.py:213: DeprecationWarning: legacy graph executor behavior of producing json / lib / params will be removed in the next release. Please see documents of tvm.contrib.graph_executor.GraphModule for the new recommended usage.
relay_prog, target=tvm.target.Target(target, host=env.target_host), params=params
- resnet18_v1 inference graph built in 23.05s!
+ resnet18_v1 inference graph built in 22.98s!
diff --git a/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
index 1d4db1ab1..50c960243 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
@@ -303,7 +303,7 @@ The compilation steps are:
"target_host parameter is going to be deprecated. "
/workspace/python/tvm/relay/build_module.py:389: DeprecationWarning: Please use input parameter mod (tvm.IRModule) instead of deprecated parameter mod (tvm.relay.function.Function)
DeprecationWarning,
- yolov3-tiny inference graph built in 16.09s!
+ yolov3-tiny inference graph built in 15.90s!
diff --git a/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
index df0a40027..b6f361175 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
Computation times
=================
-**01:31.468** total execution time for **topic_vta_tutorials_frontend** files:
+**01:32.349** total execution time for **topic_vta_tutorials_frontend** files:
-- **00:48.207**: :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_detection.py` (``deploy_detection.py``)
-- **00:43.261**: :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``)
+- **00:48.754**: :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_detection.py` (``deploy_detection.py``)
+- **00:43.595**: :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``)
diff --git a/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
index 57ba83993..ae960996f 100644
--- a/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
Computation times
=================
-**00:03.586** total execution time for **topic_vta_tutorials_optimize** files:
+**00:03.767** total execution time for **topic_vta_tutorials_optimize** files:
-- **00:03.016**: :ref:`sphx_glr_topic_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)
-- **00:00.569**: :ref:`sphx_glr_topic_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``)
+- **00:03.118**: :ref:`sphx_glr_topic_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)
+- **00:00.649**: :ref:`sphx_glr_topic_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``)
diff --git a/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
index 0752e915b..b888a7d48 100644
--- a/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
Computation times
=================
-**00:01.027** total execution time for **topic_vta_tutorials** files:
+**00:01.227** total execution time for **topic_vta_tutorials** files:
-- **00:00.521**: :ref:`sphx_glr_topic_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``)
-- **00:00.506**: :ref:`sphx_glr_topic_vta_tutorials_vta_get_started.py` (``vta_get_started.py``)
+- **00:00.640**: :ref:`sphx_glr_topic_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``)
+- **00:00.587**: :ref:`sphx_glr_topic_vta_tutorials_vta_get_started.py` (``vta_get_started.py``)
diff --git a/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt b/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
index c59d4f4e2..4e9a20f68 100644
--- a/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
+++ b/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
@@ -306,7 +306,7 @@ We build the binary and check its correctness and performance.
.. code-block:: none
- Execution time of this operator: 94.070 ms
+ Execution time of this operator: 93.826 ms
@@ -415,6 +415,11 @@ Expression (TE) language that demonstrates how TVM can optimize computational
operations.
+.. rst-class:: sphx-glr-timing
+
+ **Total running time of the script:** ( 1 minutes 10.040 seconds)
+
+
.. _sphx_glr_download_tutorial_auto_scheduler_matmul_x86.py:
diff --git a/docs/_sources/tutorial/autotvm_relay_x86.rst.txt b/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
index 92f4b317b..fc26547d0 100644
--- a/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
+++ b/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
@@ -280,7 +280,7 @@ standard deviation.
.. code-block:: none
- {'mean': 497.63291592001224, 'median': 497.3526272499839, 'std': 0.8516730647935751}
+ {'mean': 498.0128790300296, 'median': 498.0869522500143, 'std': 0.4942183791506082}
@@ -494,31 +494,31 @@ the tuning data to.
.. code-block:: none
-
[Task 1/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 1/25] Current/Best: 17.35/ 17.35 GFLOPS | Progress: (4/20) | 5.69 s
[Task 1/25] Current/Best: 6.15/ 17.35 GFLOPS | Progress: (8/20) | 9.18 s
[Task 1/25] Current/Best: 11.50/ 22.55 GFLOPS | Progress: (12/20) | 11.63 s
[Task 1/25] Current/Best: 16.75/ 22.57 GFLOPS | Progress: (16/20) | 13.33 s
[Task 1/25] Current/Best: 11.58/ 23.92 GFLOPS | Progress: (20/20) | 15.07 s Done.
-
[Task 2/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 2/25] Current/Best: 12.10/ 13.09 GFLOPS | Progress: (4/20) | 3.65 s
[Task 2/25] Current/Best: 14.40/ 17.82 GFLOPS | Progress: (8/20) | 5.00 s
[Task 2/25] Current/Best: 21.15/ 21.15 GFLOPS | Progress: (12/20) | 6.35 s
[Task 2/25] Current/Best: 12.28/ 21.15 GFLOPS | Progress: (16/20) | 7.62 s
[Task 2/25] Current/Best: 19.75/ 21.15 GFLOPS | Progress: (20/20) | 9.18 s Done.
-
[Task 3/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 3/25] Current/Best: 1.63/ 10.51 GFLOPS | Progress: (4/20) | 5.85 s
[Task 3/25] Current/Best: 15.50/ 16.83 GFLOPS | Progress: (8/20) | 7.78 s
[Task 3/25] Current/Best: 14.92/ 16.83 GFLOPS | Progress: (12/20) | 9.53 s
[Task 3/25] Current/Best: 7.20/ 23.79 GFLOPS | Progress: (16/20) | 11.43 s
[Task 3/25] Current/Best: 12.61/ 23.79 GFLOPS | Progress: (20/20) | 15.97 s Done.
-
[Task 4/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 4/25] Current/Best: 9.47/ 20.36 GFLOPS | Progress: (4/20) | 2.36 s
[Task 4/25] Current/Best: 6.56/ 20.36 GFLOPS | Progress: (8/20) | 6.73 s
[Task 4/25] Current/Best: 21.41/ 21.41 GFLOPS | Progress: (12/20) | 11.26 s
[Task 4/25] Current/Best: 16.59/ 21.41 GFLOPS | Progress: (16/20) | 13.50 s
[Task 4/25] Current/Best: 13.12/ 21.41 GFLOPS | Progress: (20/20) | 15.40 s Done.
-
[Task 5/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 5/25] Current/Best: 9.57/ 10.47 GFLOPS | Progress: (4/20) | 2.57 s
[Task 5/25] Current/Best: 11.79/ 13.10 GFLOPS | Progress: (8/20) | 4.61 s
[Task 5/25] Current/Best: 10.28/ 18.23 GFLOPS | Progress: (12/20) | 7.57 s
[Task 5/25] Current/Best: 11.87/ 22.68 GFLOPS | Progress: (16/20) | 8.98 s
[Task 5/25] Current/Best: 11.12/ 22.68 GFLOPS | Progress: (20/20) | 10.89 s Done.
-
[Task 6/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 6/25] Current/Best: 12.20/ 20.75 GFLOPS | Progress: (4/20) | 3.93 s
[Task 6/25] Current/Best: 18.92/ 20.75 GFLOPS | Progress: (8/20) | 5.69 s
[Task 6/25] Current/Best: 13.24/ 20.75 GFLOPS | Progress: (12/20) | 7.60 s
[Task 6/25] Current/Best: 19.85/ 20.75 GFLOPS | Progress: (16/20) | 9.83 s
[Task 6/25] Current/Best: 3.74/ 20.75 GFLOPS | Progress: (20/20) | 12.33 s Done.
-
[Task 7/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 7/25] Current/Best: 11.16/ 12.70 GFLOPS | Progress: (4/20) | 3.63 s
[Task 7/25] Current/Best: 20.20/ 20.82 GFLOPS | Progress: (8/20) | 5.15 s
[Task 7/25] Current/Best: 15.87/ 20.82 GFLOPS | Progress: (12/20) | 7.08 s
[Task 7/25] Current/Best: 12.28/ 20.82 GFLOPS | Progress: (16/20) | 9.13 s
[Task 7/25] Current/Best: 6.39/ 21.62 GFLOPS | Progress: (20/20) | 11.58 s Done.
-
[Task 8/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 8/25] Current/Best: 10.02/ 14.57 GFLOPS | Progress: (4/20) | 2.90 s
[Task 8/25] Current/Best: 9.94/ 14.57 GFLOPS | Progress: (8/20) | 7.67 s
[Task 8/25] Current/Best: 13.14/ 14.57 GFLOPS | Progress: (12/20) | 13.85 s
[Task 8/25] Current/Best: 18.86/ 18.86 GFLOPS | Progress: (16/20) | 15.91 s
[Task 8/25] Current/Best: 20.09/ 20.09 GFLOPS | Progress: (20/20) | 22.37 s Done.
-
[Task 9/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 9/25] Current/Best: 14.21/ 15.41 GFLOPS | Progress: (4/20) | 11.93 s
[Task 9/25] Current/Best: 22.85/ 22.85 GFLOPS | Progress: (8/20) | 13.70 s
[Task 9/25] Current/Best: 8.18/ 22.85 GFLOPS | Progress: (12/20) | 16.05 s
[Task 9/25] Current/Best: 17.60/ 22.85 GFLOPS | Progress: (16/20) | 18.63 s
[Task 9/25] Current/Best: 8.92/ 22.85 GFLOPS | Progress: (20/20) | 26.26 s
[Task 10/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 10/25] Current/Best: 18.37/ 18.37 GFLOPS | Progress: (4/20) | 2.56 s
[Task 10/25] Current/Best: 15.54/ 18.37 GFLOPS | Progress: (8/20) | 4.14 s
[Task 10/25] Current/Best: 13.09/ 19.11 GFLOPS | Progress: (12/20) | 5.67 s
[Task 10/25] Current/Best: 18.96/ 20.43 GFLOPS | Progress: (16/20) | 6.79 s
[Task 10/25] Current/Best: 8.88/ 20.43 GFLOPS | Progress: (20/20
) | 8.32 s Done.
-
[Task 11/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 11/25] Current/Best: 12.27/ 18.08 GFLOPS | Progress: (4/20) | 3.31 s
[Task 11/25] Current/Best: 16.86/ 18.08 GFLOPS | Progress: (8/20) | 6.03 s
[Task 11/25] Current/Best: 18.10/ 18.10 GFLOPS | Progress: (12/20) | 8.05 s
[Task 11/25] Current/Best: 13.33/ 21.13 GFLOPS | Progress: (16/20) | 10.84 s
[Task 11/25] Current/Best: 19.47/ 21.52 GFLOPS | Progress: (20/20) | 12.87 s Done.
-
[Task 12/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 12/25] Current/Best: 7.80/ 18.11 GFLOPS | Progress: (4/20) | 5.40 s
[Task 12/25] Current/Best: 5.29/ 18.11 GFLOPS | Progress: (8/20) | 9.10 s
[Task 12/25] Current/Best: 19.07/ 19.13 GFLOPS | Progress: (12/20) | 11.08 s
[Task 12/25] Current/Best: 14.84/ 19.13 GFLOPS | Progress: (16/20) | 13.87 s
[Task 12/25] Current/Best: 15.13/ 19.13 GFLOPS | Progress: (20/20) | 15.78 s Done.
-
[Task 13/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 13/25] Current/Best: 8.35/ 17.25 GFLOPS | Progress: (4/20) | 3.67 s
[Task 13/25] Current/Best: 15.85/ 20.81 GFLOPS | Progress: (8/20) | 6.10 s
[Task 13/25] Current/Best: 19.53/ 21.84 GFLOPS | Progress: (12/20) | 9.04 s
[Task 13/25] Current/Best: 12.23/ 21.84 GFLOPS | Progress: (16/20) | 12.42 s
[Task 13/25] Current/Best: 18.63/ 21.84 GFLOPS | Progress: (20/20) | 14.71 s Done.
-
[Task 14/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 14/25] Current/Best: 13.57/ 13.57 GFLOPS | Progress: (4/20) | 3.24 s
[Task 14/25] Current/Best: 6.08/ 13.57 GFLOPS | Progress: (8/20) | 5.40 s
[Task 14/25] Current/Best: 20.38/ 20.38 GFLOPS | Progress: (12/20) | 7.97 s
[Task 14/25] Current/Best: 17.06/ 20.38 GFLOPS | Progress: (16/20) | 9.64 s Done.
-
[Task 14/25] Current/Best: 17.00/ 20.38 GFLOPS | Progress: (20/20) | 11.37 s
[Task 15/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 15/25] Current/Best: 14.92/ 17.67 GFLOPS | Progress: (4/20) | 2.69 s
[Task 15/25] Current/Best: 14.34/ 18.00 GFLOPS | Progress: (8/20) | 3.99 s
[Task 15/25] Current/Best: 10.27/ 22.32 GFLOPS | Progress: (12/20) | 6.07 s
[Task 15/25] Current/Best: 20.42/ 22.32 GFLOPS | Progress: (16/20) | 9.02 s
[Task 15/25] Current/Best: 9.69/ 22.32 GFLOPS | Progress: (20/20) | 10.00 s
[Task 16/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 16/25] Current/Best: 20.57/ 20.57 GFLOPS | Progress: (4/20) | 2.92 s
[Task 16/25] Current/Best: 3.04/ 20.57 GFLOPS | Progress: (8/20) | 4.53 s
[Task 16/25] Current/Best: 19.64/ 20.57 GFLOPS | Progress: (12/20) | 5.77 s
[Task 16/25] Current/Best: 17.83/ 20.57 GFLOPS | Progress: (16/20) |
7.10 s
[Task 16/25] Current/Best: 9.97/ 22.58 GFLOPS | Progress: (20/20) | 9.14 s Done.
-
[Task 17/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 17/25] Current/Best: 13.09/ 18.72 GFLOPS | Progress: (4/20) | 4.67 s
[Task 17/25] Current/Best: 14.39/ 22.62 GFLOPS | Progress: (8/20) | 7.56 s
[Task 17/25] Current/Best: 16.74/ 22.62 GFLOPS | Progress: (12/20) | 9.62 s
[Task 17/25] Current/Best: 16.43/ 22.62 GFLOPS | Progress: (16/20) | 11.76 s
[Task 17/25] Current/Best: 10.01/ 22.62 GFLOPS | Progress: (20/20) | 13.90 s Done.
-
[Task 18/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 18/25] Current/Best: 11.40/ 17.97 GFLOPS | Progress: (4/20) | 3.67 s
[Task 18/25] Current/Best: 10.51/ 20.10 GFLOPS | Progress: (8/20) | 7.13 s
[Task 18/25] Current/Best: 18.96/ 20.10 GFLOPS | Progress: (12/20) | 9.05 s
[Task 18/25] Current/Best: 9.95/ 20.10 GFLOPS | Progress: (16/20) | 12.61 s
[Task 18/25] Current/Best: 20.71/ 20.71 GFLOPS | Progress: (20/20) | 14.12 s Done.
-
[Task 19/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 19/25] Current/Best: 6.16/ 20.24 GFLOPS | Progress: (4/20) | 6.16 s
[Task 19/25] Current/Best: 2.60/ 20.24 GFLOPS | Progress: (8/20) | 9.41 s
[Task 19/25] Current/Best: 19.24/ 20.84 GFLOPS | Progress: (12/20) | 12.17 s
[Task 19/25] Current/Best: 15.28/ 20.92 GFLOPS | Progress: (16/20) | 15.02 s
[Task 19/25] Current/Best: 2.70/ 23.07 GFLOPS | Progress: (20/20) | 17.79 s Done.
-
[Task 20/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 20/25] Current/Best: 9.11/ 14.96 GFLOPS | Progress: (4/20) | 3.35 s Done.
+
[Task 1/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 1/25] Current/Best: 17.43/ 17.43 GFLOPS | Progress: (4/20) | 6.21 s
[Task 1/25] Current/Best: 6.15/ 17.43 GFLOPS | Progress: (8/20) | 9.21 s
[Task 1/25] Current/Best: 11.52/ 22.59 GFLOPS | Progress: (12/20) | 11.68 s
[Task 1/25] Current/Best: 16.75/ 22.65 GFLOPS | Progress: (16/20) | 13.38 s
[Task 1/25] Current/Best: 11.56/ 23.83 GFLOPS | Progress: (20/20) | 15.13 s Done.
+
[Task 2/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 2/25] Current/Best: 12.15/ 12.94 GFLOPS | Progress: (4/20) | 3.79 s
[Task 2/25] Current/Best: 13.87/ 18.26 GFLOPS | Progress: (8/20) | 5.09 s
[Task 2/25] Current/Best: 20.95/ 20.95 GFLOPS | Progress: (12/20) | 6.41 s
[Task 2/25] Current/Best: 12.23/ 20.95 GFLOPS | Progress: (16/20) | 7.70 s
[Task 2/25] Current/Best: 19.49/ 20.95 GFLOPS | Progress: (20/20) | 9.32 s Done.
+
[Task 3/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 3/25] Current/Best: 1.63/ 10.54 GFLOPS | Progress: (4/20) | 5.85 s
[Task 3/25] Current/Best: 15.47/ 16.82 GFLOPS | Progress: (8/20) | 7.79 s
[Task 3/25] Current/Best: 14.84/ 16.82 GFLOPS | Progress: (12/20) | 9.51 s
[Task 3/25] Current/Best: 7.13/ 23.47 GFLOPS | Progress: (16/20) | 11.42 s
[Task 3/25] Current/Best: 12.47/ 23.47 GFLOPS | Progress: (20/20) | 15.99 s Done.
+
[Task 4/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 4/25] Current/Best: 9.52/ 20.31 GFLOPS | Progress: (4/20) | 2.36 s
[Task 4/25] Current/Best: 6.58/ 20.31 GFLOPS | Progress: (8/20) | 6.76 s
[Task 4/25] Current/Best: 21.93/ 21.93 GFLOPS | Progress: (12/20) | 11.38 s
[Task 4/25] Current/Best: 16.69/ 21.93 GFLOPS | Progress: (16/20) | 13.63 s
[Task 4/25] Current/Best: 13.30/ 21.93 GFLOPS | Progress: (20/20) | 15.67 s Done.
+
[Task 5/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 5/25] Current/Best: 9.64/ 10.17 GFLOPS | Progress: (4/20) | 2.57 s
[Task 5/25] Current/Best: 11.74/ 12.01 GFLOPS | Progress: (8/20) | 4.64 s
[Task 5/25] Current/Best: 10.37/ 18.09 GFLOPS | Progress: (12/20) | 7.77 s
[Task 5/25] Current/Best: 11.68/ 22.63 GFLOPS | Progress: (16/20) | 9.21 s
[Task 5/25] Current/Best: 11.70/ 22.63 GFLOPS | Progress: (20/20) | 11.08 s Done.
+
[Task 6/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 6/25] Current/Best: 12.27/ 20.65 GFLOPS | Progress: (4/20) | 3.97 s
[Task 6/25] Current/Best: 18.94/ 20.65 GFLOPS | Progress: (8/20) | 5.74 s
[Task 6/25] Current/Best: 13.25/ 20.65 GFLOPS | Progress: (12/20) | 7.68 s
[Task 6/25] Current/Best: 19.81/ 20.65 GFLOPS | Progress: (16/20) | 9.97 s
[Task 6/25] Current/Best: 3.76/ 20.65 GFLOPS | Progress: (20/20) | 12.51 s Done.
+
[Task 7/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 7/25] Current/Best: 10.12/ 12.16 GFLOPS | Progress: (4/20) | 3.57 s
[Task 7/25] Current/Best: 20.07/ 21.02 GFLOPS | Progress: (8/20) | 5.09 s
[Task 7/25] Current/Best: 15.87/ 21.02 GFLOPS | Progress: (12/20) | 7.02 s
[Task 7/25] Current/Best: 12.20/ 21.02 GFLOPS | Progress: (16/20) | 9.08 s
[Task 7/25] Current/Best: 6.31/ 21.59 GFLOPS | Progress: (20/20) | 11.57 s Done.
+
[Task 8/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 8/25] Current/Best: 10.20/ 14.15 GFLOPS | Progress: (4/20) | 2.90 s
[Task 8/25] Current/Best: 9.44/ 14.15 GFLOPS | Progress: (8/20) | 7.73 s
[Task 8/25] Current/Best: 13.09/ 14.15 GFLOPS | Progress: (12/20) | 13.92 s
[Task 8/25] Current/Best: 18.78/ 18.78 GFLOPS | Progress: (16/20) | 16.01 s
[Task 8/25] Current/Best: 19.50/ 19.50 GFLOPS | Progress: (20/20) | 22.57 s Done.
+
[Task 9/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 9/25] Current/Best: 14.28/ 15.61 GFLOPS | Progress: (4/20) | 11.95 s
[Task 9/25] Current/Best: 23.31/ 23.31 GFLOPS | Progress: (8/20) | 13.74 s
[Task 9/25] Current/Best: 8.25/ 23.31 GFLOPS | Progress: (12/20) | 16.11 s
[Task 9/25] Current/Best: 17.93/ 23.31 GFLOPS | Progress: (16/20) | 18.81 s
[Task 9/25] Current/Best: 8.96/ 23.31 GFLOPS | Progress: (20/20) | 26.58 s
[Task 10/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 10/25] Current/Best: 18.44/ 18.44 GFLOPS | Progress: (4/20) | 2.55 s
[Task 10/25] Current/Best: 15.56/ 18.44 GFLOPS | Progress: (8/20) | 4.19 s
[Task 10/25] Current/Best: 12.88/ 19.01 GFLOPS | Progress: (12/20) | 5.72 s
[Task 10/25] Current/Best: 19.11/ 20.31 GFLOPS | Progress: (16/20) | 6.84 s
[Task 10/25] Current/Best: 8.89/ 20.31 GFLOPS | Progress: (20/20
) | 8.38 s Done.
+
[Task 11/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 11/25] Current/Best: 11.64/ 18.03 GFLOPS | Progress: (4/20) | 3.34 s
[Task 11/25] Current/Best: 14.92/ 18.03 GFLOPS | Progress: (8/20) | 6.11 s
[Task 11/25] Current/Best: 16.23/ 18.03 GFLOPS | Progress: (12/20) | 8.16 s
[Task 11/25] Current/Best: 13.45/ 21.15 GFLOPS | Progress: (16/20) | 11.03 s
[Task 11/25] Current/Best: 19.44/ 21.56 GFLOPS | Progress: (20/20) | 13.05 s Done.
+
[Task 12/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 12/25] Current/Best: 7.78/ 18.23 GFLOPS | Progress: (4/20) | 5.35 s
[Task 12/25] Current/Best: 5.23/ 18.23 GFLOPS | Progress: (8/20) | 9.07 s
[Task 12/25] Current/Best: 19.04/ 19.04 GFLOPS | Progress: (12/20) | 11.04 s
[Task 12/25] Current/Best: 15.37/ 19.04 GFLOPS | Progress: (16/20) | 13.83 s
[Task 12/25] Current/Best: 15.16/ 19.04 GFLOPS | Progress: (20/20) | 15.76 s Done.
+
[Task 13/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 13/25] Current/Best: 8.73/ 17.31 GFLOPS | Progress: (4/20) | 3.62 s
[Task 13/25] Current/Best: 15.76/ 20.75 GFLOPS | Progress: (8/20) | 6.09 s
[Task 13/25] Current/Best: 19.56/ 21.64 GFLOPS | Progress: (12/20) | 9.01 s
[Task 13/25] Current/Best: 12.23/ 21.64 GFLOPS | Progress: (16/20) | 12.44 s
[Task 13/25] Current/Best: 18.61/ 21.64 GFLOPS | Progress: (20/20) | 14.69 s Done.
+
[Task 14/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 14/25] Current/Best: 13.52/ 13.52 GFLOPS | Progress: (4/20) | 3.32 s
[Task 14/25] Current/Best: 6.12/ 13.52 GFLOPS | Progress: (8/20) | 5.51 s
[Task 14/25] Current/Best: 20.56/ 20.56 GFLOPS | Progress: (12/20) | 8.05 s
[Task 14/25] Current/Best: 16.75/ 20.56 GFLOPS | Progress: (16/20) | 9.73 s
[Task 14/25] Current/Best: 17.23/ 20.56 GFLOPS | Progress: (20/20) | 11.48 s
[Task 15/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s Done.
Done.
-
[Task 20/25] Current/Best: 10.12/ 14.96 GFLOPS | Progress: (8/20) | 6.65 s
[Task 20/25] Current/Best: 2.32/ 16.81 GFLOPS | Progress: (12/20) | 10.62 s
[Task 20/25] Current/Best: 12.56/ 16.81 GFLOPS | Progress: (16/20) | 14.18 s
[Task 20/25] Current/Best: 13.18/ 21.64 GFLOPS | Progress: (20/20) | 16.27 s
[Task 21/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 21/25] Current/Best: 6.34/ 17.73 GFLOPS | Progress: (4/20) | 3.22 s
[Task 21/25] Current/Best: 14.37/ 17.73 GFLOPS | Progress: (8/20) | 4.78 s
[Task 21/25] Current/Best: 1.61/ 17.73 GFLOPS | Progress: (12/20) | 6.91 s
[Task 21/25] Current/Best: 17.98/ 17.98 GFLOPS | Progress: (16/20) | 10.35 s
[Task 21/25] Current/Best: 4.46/ 17.98 GFLOPS | Progress: (20/20) | 17.55 s
[Task 22/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 22/25] Current/Best: 2.70/ 16.82 GFLOPS | Progress: (4/20
) | 2.68 s
[Task 22/25] Current/Best: 8.76/ 21.13 GFLOPS | Progress: (8/20) | 4.66 s
[Task 22/25] Current/Best: 19.93/ 21.13 GFLOPS | Progress: (12/20) | 6.99 s
[Task 22/25] Current/Best: 15.06/ 21.13 GFLOPS | Progress: (16/20) | 9.05 s
[Task 22/25] Current/Best: 15.18/ 21.13 GFLOPS | Progress: (20/20) | 10.73 s Done.
-
[Task 23/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 23/25] Current/Best: 17.40/ 20.21 GFLOPS | Progress: (4/20) | 3.23 s
[Task 23/25] Current/Best: 15.68/ 20.21 GFLOPS | Progress: (8/20) | 6.59 s
[Task 23/25] Current/Best: 20.80/ 21.32 GFLOPS | Progress: (12/20) | 8.41 s
[Task 23/25] Current/Best: 6.13/ 21.32 GFLOPS | Progress: (16/20) | 15.53 s
[Task 23/25] Current/Best: 7.54/ 21.32 GFLOPS | Progress: (20/20) | 19.77 s Done.
-
[Task 24/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 24/25] Current/Best: 8.61/ 8.61 GFLOPS | Progress: (4/20) | 11.81 s
[Task 24/25] Current/Best: 1.94/ 8.61 GFLOPS | Progress: (8/20) | 22.87 s
[Task 24/25] Current/Best: 4.48/ 8.61 GFLOPS | Progress: (12/20) | 34.36 s Done.
- Done.
-
[Task 24/25] Current/Best: 7.19/ 8.83 GFLOPS | Progress: (16/20) | 39.86 s
[Task 24/25] Current/Best: 3.30/ 8.83 GFLOPS | Progress: (20/20) | 45.75 s Done.
-
[Task 25/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 25/25] Current/Best: 1.55/ 2.92 GFLOPS | Progress: (4/20) | 11.59 s
[Task 25/25] Current/Best: 5.86/ 7.98 GFLOPS | Progress: (8/20) | 22.81 s
[Task 25/25] Current/Best: 5.97/ 7.98 GFLOPS | Progress: (12/20) | 34.12 s
[Task 25/25] Current/Best: 5.81/ 9.45 GFLOPS | Progress: (16/20) | 35.98 s
[Task 25/25] Current/Best: 2.95/ 9.45 GFLOPS | Progress: (20/20) | 46.64 s
+
[Task 15/25] Current/Best: 16.12/ 17.63 GFLOPS | Progress: (4/20) | 2.66 s
[Task 15/25] Current/Best: 13.08/ 18.10 GFLOPS | Progress: (8/20) | 4.00 s
[Task 15/25] Current/Best: 10.36/ 22.03 GFLOPS | Progress: (12/20) | 6.11 s
[Task 15/25] Current/Best: 20.41/ 22.03 GFLOPS | Progress: (16/20) | 9.17 s
[Task 15/25] Current/Best: 9.69/ 22.03 GFLOPS | Progress: (20/20) | 10.19 s
[Task 16/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 16/25] Current/Best: 19.67/ 19.67 GFLOPS | Progress: (4/20) | 3.00 s
[Task 16/25] Current/Best: 3.04/ 19.67 GFLOPS | Progress: (8/20) | 4.65 s
[Task 16/25] Current/Best: 19.60/ 19.67 GFLOPS | Progress: (12/20) | 5.87 s
[Task 16/25] Current/Best: 17.39/ 19.67 GFLOPS | Progress: (16/20) | 7.23 s
[Task 16/25] Current/Best: 10.01/ 22.17 GFLOPS | Progress: (20/20) | 9.27 s Done.
+
[Task 17/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 17/25] Current/Best: 13.43/ 17.41 GFLOPS | Progress: (4/20) | 4.68 s
[Task 17/25] Current/Best: 12.79/ 23.30 GFLOPS | Progress: (8/20) | 7.56 s
[Task 17/25] Current/Best: 16.74/ 23.30 GFLOPS | Progress: (12/20) | 9.62 s
[Task 17/25] Current/Best: 16.51/ 23.30 GFLOPS | Progress: (16/20) | 11.77 s
[Task 17/25] Current/Best: 10.03/ 23.30 GFLOPS | Progress: (20/20) | 13.90 s Done.
+
[Task 18/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 18/25] Current/Best: 11.28/ 18.09 GFLOPS | Progress: (4/20) | 3.69 s
[Task 18/25] Current/Best: 10.55/ 19.95 GFLOPS | Progress: (8/20) | 7.13 s
[Task 18/25] Current/Best: 19.43/ 19.95 GFLOPS | Progress: (12/20) | 9.07 s
[Task 18/25] Current/Best: 9.85/ 19.95 GFLOPS | Progress: (16/20) | 12.67 s
[Task 18/25] Current/Best: 20.49/ 20.49 GFLOPS | Progress: (20/20) | 14.20 s Done.
+
[Task 19/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 19/25] Current/Best: 6.65/ 20.29 GFLOPS | Progress: (4/20) | 6.11 s
[Task 19/25] Current/Best: 2.61/ 20.29 GFLOPS | Progress: (8/20) | 9.40 s
[Task 19/25] Current/Best: 19.25/ 21.15 GFLOPS | Progress: (12/20) | 12.17 s
[Task 19/25] Current/Best: 15.07/ 21.15 GFLOPS | Progress: (16/20) | 15.01 s
[Task 19/25] Current/Best: 2.70/ 23.36 GFLOPS | Progress: (20/20) | 17.80 s Done.
+
[Task 20/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 20/25] Current/Best: 9.50/ 15.16 GFLOPS | Progress: (4/20) | 3.29 s
[Task 20/25] Current/Best: 10.22/ 15.16 GFLOPS | Progress: (8/20) | 6.59 s
[Task 20/25] Current/Best: 2.32/ 16.72 GFLOPS | Progress: (12/20) | 10.61 s Done.
+
[Task 20/25] Current/Best: 12.33/ 16.72 GFLOPS | Progress: (16/20) | 14.40 s
[Task 20/25] Current/Best: 13.39/ 21.97 GFLOPS | Progress: (20/20) | 16.51 s Done.
+
[Task 21/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 21/25] Current/Best: 6.40/ 17.65 GFLOPS | Progress: (4/20) | 3.20 s
[Task 21/25] Current/Best: 14.52/ 17.65 GFLOPS | Progress: (8/20) | 4.74 s
[Task 21/25] Current/Best: 1.61/ 17.65 GFLOPS | Progress: (12/20) | 6.89 s
[Task 21/25] Current/Best: 18.22/ 18.22 GFLOPS | Progress: (16/20) | 10.32 s
[Task 21/25] Current/Best: 4.46/ 18.22 GFLOPS | Progress: (20/20) | 17.44 s
[Task 22/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 22/25] Current/Best: 2.70/ 16.88 GFLOPS | Progress: (4/20) | 2.68 s
[Task 22/25] Current/Best: 9.07/ 21.56 GFLOPS | Progress: (8/20) | 4.65 s
[Task 22/25] Current/Best: 19.70/ 21.56 GFLOPS | Progress: (12/20) | 7.00 s
[Task 22/25] Current/Best: 15.13/ 21.56 GFLOPS | Progress: (16/20) | 9.05 s
[Task 22/25] Current/Best: 15.20/ 21.56 GFLOPS | Progress: (20/20) |
10.72 s Done.
+
[Task 23/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 23/25] Current/Best: 17.42/ 20.33 GFLOPS | Progress: (4/20) | 3.21 s
[Task 23/25] Current/Best: 15.71/ 20.33 GFLOPS | Progress: (8/20) | 6.60 s
[Task 23/25] Current/Best: 20.87/ 21.44 GFLOPS | Progress: (12/20) | 8.42 s
[Task 23/25] Current/Best: 6.17/ 21.44 GFLOPS | Progress: (16/20) | 15.64 s
[Task 23/25] Current/Best: 7.58/ 21.44 GFLOPS | Progress: (20/20) | 19.87 s Done.
+
[Task 24/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 24/25] Current/Best: 8.71/ 8.71 GFLOPS | Progress: (4/20) | 11.79 s
[Task 24/25] Current/Best: 2.07/ 8.71 GFLOPS | Progress: (8/20) | 22.83 s
[Task 24/25] Current/Best: 4.13/ 8.71 GFLOPS | Progress: (12/20) | 34.35 s
[Task 24/25] Current/Best: 6.81/ 8.71 GFLOPS | Progress: (16/20) | 39.87 s Done.
+
[Task 24/25] Current/Best: 3.23/ 8.98 GFLOPS | Progress: (20/20) | 45.85 s Done.
+
[Task 25/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/20) | 0.00 s
[Task 25/25] Current/Best: 1.54/ 2.95 GFLOPS | Progress: (4/20) | 11.58 s
[Task 25/25] Current/Best: 5.81/ 7.76 GFLOPS | Progress: (8/20) | 22.89 s
[Task 25/25] Current/Best: 5.84/ 7.76 GFLOPS | Progress: (12/20) | 34.31 s
[Task 25/25] Current/Best: 5.86/ 9.44 GFLOPS | Progress: (16/20) | 36.03 s
[Task 25/25] Current/Best: 2.92/ 9.44 GFLOPS | Progress: (20/20) | 46.74 s
The output from this tuning process will look something like this:
@@ -660,8 +660,8 @@ improvement in comparing the optimized model to the unoptimized model.
.. code-block:: none
- optimized: {'mean': 411.34827297999436, 'median': 411.4683489499839, 'std': 0.7163906572280575}
- unoptimized: {'mean': 497.63291592001224, 'median': 497.3526272499839, 'std': 0.8516730647935751}
+ optimized: {'mean': 413.9322347700363, 'median': 413.58360550002544, 'std': 0.7686886842152099}
+ unoptimized: {'mean': 498.0128790300296, 'median': 498.0869522500143, 'std': 0.4942183791506082}
@@ -681,7 +681,7 @@ profiling/benchmarking.
.. rst-class:: sphx-glr-timing
- **Total running time of the script:** ( 10 minutes 15.922 seconds)
+ **Total running time of the script:** ( 10 minutes 18.682 seconds)
.. _sphx_glr_download_tutorial_autotvm_relay_x86.py:
diff --git a/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt b/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
index ae76ec7b9..b066b2f33 100644
--- a/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
+++ b/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
@@ -235,7 +235,7 @@ device and returns the measured cost. Network overhead is excluded.
.. code-block:: none
- 1.274e-07 secs/op
+ 1.286e-07 secs/op
diff --git a/docs/_sources/tutorial/intro_topi.rst.txt b/docs/_sources/tutorial/intro_topi.rst.txt
index ac5ae91f9..daa56ba65 100644
--- a/docs/_sources/tutorial/intro_topi.rst.txt
+++ b/docs/_sources/tutorial/intro_topi.rst.txt
@@ -232,7 +232,7 @@ As you can see, scheduled stages of computation have been accumulated and we can
.. code-block:: none
- [stage(a, placeholder(a, 0x26fef2b0)), stage(b, placeholder(b, 0x23a53050)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(mi [...]
+ [stage(a, placeholder(a, 0x203e1fe0)), stage(b, placeholder(b, 0x2125fe60)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(mi [...]
diff --git a/docs/_sources/tutorial/sg_execution_times.rst.txt b/docs/_sources/tutorial/sg_execution_times.rst.txt
index 1844625da..c218f812b 100644
--- a/docs/_sources/tutorial/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorial/sg_execution_times.rst.txt
@@ -5,17 +5,17 @@
Computation times
=================
-**13:08.061** total execution time for **tutorial** files:
+**13:24.053** total execution time for **tutorial** files:
-- **10:15.922**: :ref:`sphx_glr_tutorial_autotvm_relay_x86.py` (``autotvm_relay_x86.py``)
-- **00:59.038**: :ref:`sphx_glr_tutorial_tensor_expr_get_started.py` (``tensor_expr_get_started.py``)
-- **00:57.957**: :ref:`sphx_glr_tutorial_auto_scheduler_matmul_x86.py` (``auto_scheduler_matmul_x86.py``)
-- **00:28.568**: :ref:`sphx_glr_tutorial_relay_quick_start.py` (``relay_quick_start.py``)
-- **00:24.751**: :ref:`sphx_glr_tutorial_autotvm_matmul_x86.py` (``autotvm_matmul_x86.py``)
-- **00:00.743**: :ref:`sphx_glr_tutorial_intro_topi.py` (``intro_topi.py``)
-- **00:00.651**: :ref:`sphx_glr_tutorial_tensor_ir_blitz_course.py` (``tensor_ir_blitz_course.py``)
-- **00:00.232**: :ref:`sphx_glr_tutorial_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``)
-- **00:00.051**: :ref:`sphx_glr_tutorial_tvmc_command_line_driver.py` (``tvmc_command_line_driver.py``)
-- **00:00.051**: :ref:`sphx_glr_tutorial_install.py` (``install.py``)
-- **00:00.048**: :ref:`sphx_glr_tutorial_introduction.py` (``introduction.py``)
-- **00:00.047**: :ref:`sphx_glr_tutorial_tvmc_python.py` (``tvmc_python.py``)
+- **10:18.682**: :ref:`sphx_glr_tutorial_autotvm_relay_x86.py` (``autotvm_relay_x86.py``)
+- **01:10.040**: :ref:`sphx_glr_tutorial_auto_scheduler_matmul_x86.py` (``auto_scheduler_matmul_x86.py``)
+- **01:00.441**: :ref:`sphx_glr_tutorial_tensor_expr_get_started.py` (``tensor_expr_get_started.py``)
+- **00:28.522**: :ref:`sphx_glr_tutorial_relay_quick_start.py` (``relay_quick_start.py``)
+- **00:24.320**: :ref:`sphx_glr_tutorial_autotvm_matmul_x86.py` (``autotvm_matmul_x86.py``)
+- **00:00.854**: :ref:`sphx_glr_tutorial_tensor_ir_blitz_course.py` (``tensor_ir_blitz_course.py``)
+- **00:00.747**: :ref:`sphx_glr_tutorial_intro_topi.py` (``intro_topi.py``)
+- **00:00.252**: :ref:`sphx_glr_tutorial_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``)
+- **00:00.050**: :ref:`sphx_glr_tutorial_tvmc_python.py` (``tvmc_python.py``)
+- **00:00.049**: :ref:`sphx_glr_tutorial_introduction.py` (``introduction.py``)
+- **00:00.049**: :ref:`sphx_glr_tutorial_install.py` (``install.py``)
+- **00:00.048**: :ref:`sphx_glr_tutorial_tvmc_command_line_driver.py` (``tvmc_command_line_driver.py``)
diff --git a/docs/_sources/tutorial/tensor_expr_get_started.rst.txt b/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
index 6194d0581..97d5bf2bc 100644
--- a/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
+++ b/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
@@ -253,7 +253,7 @@ helper function to run a profile of the TVM generated code.
.. code-block:: none
Numpy running time: 0.000008
- naive: 0.000006
+ naive: 0.000007
@@ -447,10 +447,10 @@ We can now compare the different schedules
.. code-block:: none
Operator Timing Performance
- numpy 8.123029992930242e-06 1.0
- naive 5.8526000000000005e-06 0.7204946928786086
- parallel 6.9821e-06 0.8595437916733986
- vector 2.4613099999999996e-05 3.0300392860080096
+ numpy 7.628059975104406e-06 1.0
+ naive 6.705600000000001e-06 0.879070172741821
+ parallel 7.0514e-06 0.9244028000584102
+ vector 2.45713e-05 3.2211728906423143
@@ -839,7 +839,7 @@ matrix multiplication.
.. code-block:: none
- Numpy running time: 0.018895
+ Numpy running time: 0.019087
@@ -897,7 +897,7 @@ optimizations.
/workspace/python/tvm/driver/build_module.py:264: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
"target_host parameter is going to be deprecated. "
- none: 3.234409
+ none: 3.350063
@@ -996,7 +996,7 @@ schedule.
.. code-block:: none
- blocking: 0.312457
+ blocking: 0.311091
@@ -1088,7 +1088,7 @@ already cache friendly from our previous optimizations.
.. code-block:: none
- vectorization: 0.341634
+ vectorization: 0.344023
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1160,7 +1160,7 @@ more cache friendly.
.. code-block:: none
- loop permutation: 0.122646
+ loop permutation: 0.123198
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1257,7 +1257,7 @@ optimized schedule.
.. code-block:: none
- array packing: 0.111257
+ array packing: 0.108706
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1348,7 +1348,7 @@ to `C` when all the block results are ready.
.. code-block:: none
- block caching: 0.113858
+ block caching: 0.110844
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1432,7 +1432,7 @@ of thread-level parallelization.
.. code-block:: none
- parallelization: 0.146408
+ parallelization: 0.145439
@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1511,13 +1511,13 @@ working, we can compare the results.
.. code-block:: none
Operator Timing Performance
- none 3.2344094218999997 1.0
- blocking 0.3124573388 0.09660413944022342
- vectorization 0.3416344174 0.10562497594980186
- loop permutation 0.1226462242 0.037919201993900184
- array packing 0.11125734849999999 0.03439804118386587
- block caching 0.1138582624 0.035202179918557085
- parallelization 0.1464084819 0.04526590879579951
+ none 3.3500625102 1.0
+ blocking 0.31109146179999997 0.09286139015400871
+ vectorization 0.3440227387 0.10269143863809924
+ loop permutation 0.1231980128 0.036774839999222896
+ array packing 0.1087057597 0.032448875019203814
+ block caching 0.1108440386 0.03308715531800109
+ parallelization 0.1454394921 0.043413963666999525
@@ -1552,6 +1552,11 @@ operations with tunable parameters that allows you to automatically optimize
the computation for specific platforms.
+.. rst-class:: sphx-glr-timing
+
+ **Total running time of the script:** ( 1 minutes 0.441 seconds)
+
+
.. _sphx_glr_download_tutorial_tensor_expr_get_started.py:
diff --git a/docs/commit_hash b/docs/commit_hash
index 36c5d17e6..e7e459734 100644
--- a/docs/commit_hash
+++ b/docs/commit_hash
@@ -1 +1 @@
-0df69611b2fb46724a0023dd8d389c9a1ecedcb8
+8f6543e9e6173cd45b678e91b5a637ff7f8e0e02
diff --git a/docs/how_to/compile_models/from_mxnet.html b/docs/how_to/compile_models/from_mxnet.html
index 5eacad5fc..14791c7f4 100644
--- a/docs/how_to/compile_models/from_mxnet.html
+++ b/docs/how_to/compile_models/from_mxnet.html
@@ -401,7 +401,7 @@
</div>
<img alt="../../_images/sphx_glr_from_mxnet_001.png" class="sphx-glr-single-img" src="../../_images/sphx_glr_from_mxnet_001.png" />
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip4c98474d-b9a5-4b38-b0e2-b9ba0c921c4d from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip3aa5fed0-f0aa-4b27-b09c-8203092b35b5 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
x (1, 3, 224, 224)
</pre></div>
</div>
diff --git a/docs/how_to/compile_models/from_oneflow.html b/docs/how_to/compile_models/from_oneflow.html
index 6a312ae00..c057ef82a 100644
--- a/docs/how_to/compile_models/from_oneflow.html
+++ b/docs/how_to/compile_models/from_oneflow.html
@@ -406,65 +406,45 @@ python3 -m pip install -f https://release.oneflow.info <span class="nv">oneflow<
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: "https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip" to /workspace/.oneflow/flowvision_cache/resnet18.zip
0%| | 0.00/41.5M [00:00<?, ?B/s]
- 0%| | 16.0k/41.5M [00:00<08:11, 88.4kB/s]
- 0%| | 48.0k/41.5M [00:00<05:10, 140kB/s]
- 0%| | 72.0k/41.5M [00:00<05:18, 136kB/s]
- 0%| | 160k/41.5M [00:00<02:38, 273kB/s]
- 1%| | 328k/41.5M [00:00<01:25, 508kB/s]
- 1%|1 | 512k/41.5M [00:01<01:03, 679kB/s]
- 2%|1 | 704k/41.5M [00:01<00:53, 802kB/s]
- 2%|2 | 904k/41.5M [00:01<00:47, 896kB/s]
- 3%|2 | 1.09M/41.5M [00:01<00:42, 987kB/s]
- 3%|3 | 1.31M/41.5M [00:01<00:39, 1.06MB/s]
- 4%|3 | 1.54M/41.5M [00:02<00:37, 1.13MB/s]
- 4%|4 | 1.78M/41.5M [00:02<00:34, 1.20MB/s]
- 5%|4 | 2.03M/41.5M [00:02<00:32, 1.26MB/s]
- 6%|5 | 2.30M/41.5M [00:02<00:30, 1.33MB/s]
- 6%|6 | 2.57M/41.5M [00:02<00:29, 1.39MB/s]
- 7%|6 | 2.87M/41.5M [00:02<00:27, 1.48MB/s]
- 8%|7 | 3.17M/41.5M [00:03<00:25, 1.55MB/s]
- 8%|8 | 3.49M/41.5M [00:03<00:24, 1.63MB/s]
- 9%|9 | 3.83M/41.5M [00:03<00:23, 1.70MB/s]
- 10%|# | 4.18M/41.5M [00:03<00:21, 1.79MB/s]
- 11%|# | 4.55M/41.5M [00:03<00:20, 1.87MB/s]
- 12%|#1 | 4.93M/41.5M [00:04<00:19, 1.96MB/s]
- 13%|#2 | 5.34M/41.5M [00:04<00:18, 2.05MB/s]
- 14%|#3 | 5.76M/41.5M [00:04<00:17, 2.15MB/s]
- 15%|#4 | 6.20M/41.5M [00:04<00:16, 2.26MB/s]
- 16%|#6 | 6.67M/41.5M [00:04<00:15, 2.37MB/s]
- 17%|#7 | 7.16M/41.5M [00:05<00:14, 2.49MB/s]
- 19%|#8 | 7.68M/41.5M [00:05<00:13, 2.61MB/s]
- 20%|#9 | 8.22M/41.5M [00:05<00:12, 2.74MB/s]
- 21%|##1 | 8.79M/41.5M [00:05<00:11, 2.88MB/s]
- 23%|##2 | 9.39M/41.5M [00:05<00:11, 3.03MB/s]
- 24%|##4 | 10.0M/41.5M [00:05<00:10, 3.18MB/s]
- 26%|##5 | 10.7M/41.5M [00:06<00:09, 3.33MB/s]
- 27%|##7 | 11.4M/41.5M [00:06<00:09, 3.49MB/s]
- 29%|##9 | 12.1M/41.5M [00:06<00:08, 3.67MB/s]
- 31%|### | 12.8M/41.5M [00:06<00:07, 3.85MB/s]
- 33%|###2 | 13.6M/41.5M [00:06<00:07, 4.04MB/s]
- 35%|###4 | 14.5M/41.5M [00:07<00:06, 4.24MB/s]
- 37%|###7 | 15.4M/41.5M [00:07<00:06, 4.44MB/s]
- 39%|###9 | 16.3M/41.5M [00:07<00:05, 4.66MB/s]
- 42%|####1 | 17.2M/41.5M [00:07<00:05, 4.88MB/s]
- 44%|####3 | 18.2M/41.5M [00:07<00:04, 5.13MB/s]
- 47%|####6 | 19.3M/41.5M [00:07<00:04, 5.63MB/s]
- 49%|####9 | 20.4M/41.5M [00:08<00:03, 5.83MB/s]
- 52%|#####2 | 21.6M/41.5M [00:08<00:03, 6.08MB/s]
- 55%|#####5 | 22.8M/41.5M [00:08<00:03, 6.33MB/s]
- 58%|#####8 | 24.1M/41.5M [00:08<00:02, 6.61MB/s]
- 61%|######1 | 25.5M/41.5M [00:08<00:02, 6.62MB/s]
- 65%|######4 | 26.9M/41.5M [00:09<00:02, 7.04MB/s]
- 68%|######8 | 28.4M/41.5M [00:09<00:01, 7.41MB/s]
- 72%|#######1 | 29.9M/41.5M [00:09<00:01, 7.66MB/s]
- 76%|#######5 | 31.3M/41.5M [00:09<00:01, 7.85MB/s]
- 79%|#######9 | 32.8M/41.5M [00:09<00:01, 7.99MB/s]
- 83%|########2 | 34.3M/41.5M [00:10<00:00, 8.06MB/s]
- 86%|########6 | 35.7M/41.5M [00:10<00:00, 8.12MB/s]
- 90%|########9 | 37.2M/41.5M [00:10<00:00, 8.15MB/s]
- 93%|#########3| 38.7M/41.5M [00:10<00:00, 8.20MB/s]
- 97%|#########6| 40.1M/41.5M [00:10<00:00, 8.22MB/s]
-100%|##########| 41.5M/41.5M [00:10<00:00, 4.02MB/s]
+ 0%| | 16.0k/41.5M [00:00<07:25, 97.6kB/s]
+ 0%| | 48.0k/41.5M [00:00<04:41, 154kB/s]
+ 0%| | 72.0k/41.5M [00:00<04:48, 150kB/s]
+ 0%| | 136k/41.5M [00:00<02:57, 244kB/s]
+ 1%| | 288k/41.5M [00:00<01:28, 489kB/s]
+ 1%|1 | 592k/41.5M [00:01<00:45, 950kB/s]
+ 3%|2 | 1.17M/41.5M [00:01<00:22, 1.84MB/s]
+ 6%|5 | 2.35M/41.5M [00:01<00:11, 3.59MB/s]
+ 9%|9 | 3.82M/41.5M [00:01<00:07, 5.32MB/s]
+ 13%|#2 | 5.29M/41.5M [00:01<00:05, 6.49MB/s]
+ 16%|#6 | 6.77M/41.5M [00:01<00:04, 7.31MB/s]
+ 20%|#9 | 8.23M/41.5M [00:02<00:04, 7.86MB/s]
+ 23%|##3 | 9.70M/41.5M [00:02<00:03, 8.80MB/s]
+ 27%|##6 | 11.1M/41.5M [00:02<00:03, 9.53MB/s]
+ 30%|### | 12.6M/41.5M [00:02<00:02, 10.8MB/s]
+ 33%|###2 | 13.7M/41.5M [00:02<00:02, 9.76MB/s]
+ 35%|###5 | 14.7M/41.5M [00:02<00:03, 8.57MB/s]
+ 38%|###7 | 15.6M/41.5M [00:02<00:03, 8.23MB/s]
+ 41%|####1 | 17.0M/41.5M [00:02<00:02, 9.71MB/s]
+ 43%|####3 | 18.0M/41.5M [00:03<00:02, 9.74MB/s]
+ 46%|####5 | 19.0M/41.5M [00:03<00:02, 8.42MB/s]
+ 48%|####8 | 20.0M/41.5M [00:03<00:02, 7.72MB/s]
+ 52%|#####1 | 21.4M/41.5M [00:03<00:02, 8.80MB/s]
+ 55%|#####5 | 22.9M/41.5M [00:03<00:01, 10.3MB/s]
+ 58%|#####7 | 23.9M/41.5M [00:03<00:01, 10.2MB/s]
+ 60%|###### | 25.0M/41.5M [00:03<00:01, 8.80MB/s]
+ 62%|######2 | 25.9M/41.5M [00:04<00:02, 7.75MB/s]
+ 66%|######5 | 27.3M/41.5M [00:04<00:01, 8.19MB/s]
+ 69%|######9 | 28.8M/41.5M [00:04<00:01, 8.49MB/s]
+ 73%|#######2 | 30.2M/41.5M [00:04<00:01, 9.69MB/s]
+ 76%|#######6 | 31.7M/41.5M [00:04<00:00, 10.9MB/s]
+ 79%|#######9 | 32.8M/41.5M [00:04<00:00, 10.3MB/s]
+ 82%|########1 | 33.8M/41.5M [00:04<00:00, 8.95MB/s]
+ 84%|########3 | 34.8M/41.5M [00:05<00:00, 7.91MB/s]
+ 87%|########7 | 36.1M/41.5M [00:05<00:00, 8.11MB/s]
+ 91%|######### | 37.6M/41.5M [00:05<00:00, 8.43MB/s]
+ 94%|#########4| 39.1M/41.5M [00:05<00:00, 8.66MB/s]
+ 98%|#########7| 40.5M/41.5M [00:05<00:00, 8.80MB/s]
+100%|##########| 41.5M/41.5M [00:05<00:00, 7.57MB/s]
</pre></div>
</div>
</div>
diff --git a/docs/how_to/compile_models/from_paddle.html b/docs/how_to/compile_models/from_paddle.html
index 5332ab95f..50b29de1a 100644
--- a/docs/how_to/compile_models/from_paddle.html
+++ b/docs/how_to/compile_models/from_paddle.html
@@ -469,7 +469,7 @@ A quick solution is</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>TVM prediction top-1 id: 282, class name: 282: 'tiger cat',
</pre></div>
</div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 6.972 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 7.772 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-compile-models-from-paddle-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/16269b77359771348d507395692524cf/from_paddle.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">from_paddle.py</span></code></a></p>
diff --git a/docs/how_to/compile_models/from_pytorch.html b/docs/how_to/compile_models/from_pytorch.html
index 6eaa4f046..5c27ad628 100644
--- a/docs/how_to/compile_models/from_pytorch.html
+++ b/docs/how_to/compile_models/from_pytorch.html
@@ -387,24 +387,9 @@ be unstable.</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
0%| | 0.00/44.7M [00:00<?, ?B/s]
- 5%|5 | 2.38M/44.7M [00:00<00:01, 24.8MB/s]
- 11%|# | 4.81M/44.7M [00:00<00:01, 25.0MB/s]
- 17%|#6 | 7.38M/44.7M [00:00<00:01, 25.4MB/s]
- 22%|##1 | 9.81M/44.7M [00:00<00:01, 25.3MB/s]
- 28%|##7 | 12.3M/44.7M [00:00<00:01, 25.6MB/s]
- 33%|###3 | 14.8M/44.7M [00:00<00:01, 25.7MB/s]
- 39%|###8 | 17.3M/44.7M [00:00<00:01, 24.8MB/s]
- 45%|####4 | 20.0M/44.7M [00:00<00:01, 25.7MB/s]
- 50%|##### | 22.5M/44.7M [00:00<00:00, 25.7MB/s]
- 56%|#####5 | 25.0M/44.7M [00:01<00:00, 25.9MB/s]
- 62%|######1 | 27.5M/44.7M [00:01<00:00, 25.7MB/s]
- 67%|######7 | 29.9M/44.7M [00:01<00:00, 25.2MB/s]
- 73%|#######2 | 32.4M/44.7M [00:01<00:00, 25.5MB/s]
- 78%|#######8 | 35.0M/44.7M [00:01<00:00, 26.0MB/s]
- 84%|########4 | 37.5M/44.7M [00:01<00:00, 25.5MB/s]
- 90%|########9 | 40.2M/44.7M [00:01<00:00, 26.0MB/s]
- 96%|#########5| 42.7M/44.7M [00:01<00:00, 26.1MB/s]
-100%|##########| 44.7M/44.7M [00:01<00:00, 25.7MB/s]
+ 41%|#### | 18.3M/44.7M [00:00<00:00, 192MB/s]
+ 97%|#########6| 43.2M/44.7M [00:00<00:00, 233MB/s]
+100%|##########| 44.7M/44.7M [00:00<00:00, 227MB/s]
</pre></div>
</div>
</div>
diff --git a/docs/how_to/compile_models/from_tensorflow.html b/docs/how_to/compile_models/from_tensorflow.html
index 401b12d35..859e4832a 100644
--- a/docs/how_to/compile_models/from_tensorflow.html
+++ b/docs/how_to/compile_models/from_tensorflow.html
@@ -612,7 +612,7 @@ banana (score = 0.00022)
desk (score = 0.00019)
</pre></div>
</div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 5.091 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 5.517 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-compile-models-from-tensorflow-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">from_tensorflow.py</span></code></a></p>
diff --git a/docs/how_to/compile_models/sg_execution_times.html b/docs/how_to/compile_models/sg_execution_times.html
index e47b8ff57..99e9c5fef 100644
--- a/docs/how_to/compile_models/sg_execution_times.html
+++ b/docs/how_to/compile_models/sg_execution_times.html
@@ -300,18 +300,18 @@
<div class="section" id="computation-times">
<span id="sphx-glr-how-to-compile-models-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>05:36.236</strong> total execution time for <strong>how_to_compile_models</strong> files:</p>
+<p><strong>05:33.413</strong> total execution time for <strong>how_to_compile_models</strong> files:</p>
<ul class="simple">
-<li><p><strong>01:06.972</strong>: <a class="reference internal" href="from_paddle.html#sphx-glr-how-to-compile-models-from-paddle-py"><span class="std std-ref">Compile PaddlePaddle Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_paddle.py</span></code>)</p></li>
-<li><p><strong>01:05.091</strong>: <a class="reference internal" href="from_tensorflow.html#sphx-glr-how-to-compile-models-from-tensorflow-py"><span class="std std-ref">Compile Tensorflow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tensorflow.py</span></code>)</p></li>
-<li><p><strong>00:59.192</strong>: <a class="reference internal" href="from_darknet.html#sphx-glr-how-to-compile-models-from-darknet-py"><span class="std std-ref">Compile YOLO-V2 and YOLO-V3 in DarkNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_darknet.py</span></code>)</p></li>
-<li><p><strong>00:36.839</strong>: <a class="reference internal" href="from_oneflow.html#sphx-glr-how-to-compile-models-from-oneflow-py"><span class="std std-ref">Compile OneFlow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_oneflow.py</span></code>)</p></li>
-<li><p><strong>00:24.334</strong>: <a class="reference internal" href="from_tflite.html#sphx-glr-how-to-compile-models-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></li>
-<li><p><strong>00:23.179</strong>: <a class="reference internal" href="from_mxnet.html#sphx-glr-how-to-compile-models-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></li>
-<li><p><strong>00:22.183</strong>: <a class="reference internal" href="from_coreml.html#sphx-glr-how-to-compile-models-from-coreml-py"><span class="std std-ref">Compile CoreML Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_coreml.py</span></code>)</p></li>
-<li><p><strong>00:21.079</strong>: <a class="reference internal" href="from_pytorch.html#sphx-glr-how-to-compile-models-from-pytorch-py"><span class="std std-ref">Compile PyTorch Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_pytorch.py</span></code>)</p></li>
-<li><p><strong>00:14.402</strong>: <a class="reference internal" href="from_keras.html#sphx-glr-how-to-compile-models-from-keras-py"><span class="std std-ref">Compile Keras Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_keras.py</span></code>)</p></li>
-<li><p><strong>00:02.966</strong>: <a class="reference internal" href="from_onnx.html#sphx-glr-how-to-compile-models-from-onnx-py"><span class="std std-ref">Compile ONNX Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_onnx.py</span></code>)</p></li>
+<li><p><strong>01:07.772</strong>: <a class="reference internal" href="from_paddle.html#sphx-glr-how-to-compile-models-from-paddle-py"><span class="std std-ref">Compile PaddlePaddle Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_paddle.py</span></code>)</p></li>
+<li><p><strong>01:05.517</strong>: <a class="reference internal" href="from_tensorflow.html#sphx-glr-how-to-compile-models-from-tensorflow-py"><span class="std std-ref">Compile Tensorflow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tensorflow.py</span></code>)</p></li>
+<li><p><strong>00:58.763</strong>: <a class="reference internal" href="from_darknet.html#sphx-glr-how-to-compile-models-from-darknet-py"><span class="std std-ref">Compile YOLO-V2 and YOLO-V3 in DarkNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_darknet.py</span></code>)</p></li>
+<li><p><strong>00:32.928</strong>: <a class="reference internal" href="from_oneflow.html#sphx-glr-how-to-compile-models-from-oneflow-py"><span class="std std-ref">Compile OneFlow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_oneflow.py</span></code>)</p></li>
+<li><p><strong>00:24.649</strong>: <a class="reference internal" href="from_tflite.html#sphx-glr-how-to-compile-models-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></li>
+<li><p><strong>00:24.258</strong>: <a class="reference internal" href="from_mxnet.html#sphx-glr-how-to-compile-models-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></li>
+<li><p><strong>00:22.451</strong>: <a class="reference internal" href="from_coreml.html#sphx-glr-how-to-compile-models-from-coreml-py"><span class="std std-ref">Compile CoreML Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_coreml.py</span></code>)</p></li>
+<li><p><strong>00:20.237</strong>: <a class="reference internal" href="from_pytorch.html#sphx-glr-how-to-compile-models-from-pytorch-py"><span class="std std-ref">Compile PyTorch Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_pytorch.py</span></code>)</p></li>
+<li><p><strong>00:14.307</strong>: <a class="reference internal" href="from_keras.html#sphx-glr-how-to-compile-models-from-keras-py"><span class="std std-ref">Compile Keras Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_keras.py</span></code>)</p></li>
+<li><p><strong>00:02.530</strong>: <a class="reference internal" href="from_onnx.html#sphx-glr-how-to-compile-models-from-onnx-py"><span class="std std-ref">Compile ONNX Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_onnx.py</span></code>)</p></li>
</ul>
</div>
diff --git a/docs/how_to/deploy_models/deploy_model_on_android.html b/docs/how_to/deploy_models/deploy_model_on_android.html
index ce2d71c7b..8709f0bd4 100644
--- a/docs/how_to/deploy_models/deploy_model_on_android.html
+++ b/docs/how_to/deploy_models/deploy_model_on_android.html
@@ -627,7 +627,7 @@ to the remote android device.</p>
Evaluate inference time cost...
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 16.3155 16.2747 16.7494 16.0559 0.2256
+ 16.5542 16.7218 17.0643 15.8826 0.4657
</pre></div>
</div>
</div>
diff --git a/docs/how_to/deploy_models/deploy_object_detection_pytorch.html b/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
index 2a01ae0ab..a29cce5f8 100644
--- a/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
+++ b/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
@@ -409,14 +409,15 @@ be unstable.</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
0%| | 0.00/170M [00:00<?, ?B/s]
- 10%|# | 17.5M/170M [00:00<00:00, 183MB/s]
- 25%|##4 | 42.1M/170M [00:00<00:00, 227MB/s]
- 39%|###9 | 66.5M/170M [00:00<00:00, 240MB/s]
- 54%|#####3 | 91.0M/170M [00:00<00:00, 246MB/s]
- 67%|######7 | 115M/170M [00:00<00:00, 242MB/s]
- 82%|########1 | 139M/170M [00:00<00:00, 246MB/s]
- 96%|#########6| 163M/170M [00:00<00:00, 250MB/s]
-100%|##########| 170M/170M [00:00<00:00, 243MB/s]
+ 2%|1 | 2.59M/170M [00:00<00:06, 26.6MB/s]
+ 4%|3 | 6.02M/170M [00:00<00:05, 32.0MB/s]
+ 16%|#6 | 27.4M/170M [00:00<00:01, 119MB/s]
+ 27%|##7 | 46.4M/170M [00:00<00:00, 151MB/s]
+ 43%|####3 | 73.6M/170M [00:00<00:00, 199MB/s]
+ 59%|#####8 | 99.9M/170M [00:00<00:00, 225MB/s]
+ 75%|#######4 | 127M/170M [00:00<00:00, 243MB/s]
+ 88%|########8 | 150M/170M [00:00<00:00, 230MB/s]
+100%|##########| 170M/170M [00:00<00:00, 196MB/s]
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3878: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
for i in range(dim)
/usr/local/lib/python3.7/dist-packages/torchvision/models/detection/anchor_utils.py:127: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
@@ -514,7 +515,7 @@ torchvision rcnn models.</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Get 9 valid boxes
</pre></div>
</div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes 58.856 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes 58.144 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-object-detection-pytorch-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_object_detection_pytorch.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_prequantized.html b/docs/how_to/deploy_models/deploy_prequantized.html
index 7b311961c..a94ba57dc 100644
--- a/docs/how_to/deploy_models/deploy_prequantized.html
+++ b/docs/how_to/deploy_models/deploy_prequantized.html
@@ -450,9 +450,10 @@ training. Other models require a full post training calibration.</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
0%| | 0.00/13.6M [00:00<?, ?B/s]
- 41%|#### | 5.55M/13.6M [00:00<00:00, 57.7MB/s]
- 82%|########1 | 11.1M/13.6M [00:00<00:00, 56.1MB/s]
-100%|##########| 13.6M/13.6M [00:00<00:00, 65.6MB/s]
+ 28%|##8 | 3.81M/13.6M [00:00<00:00, 39.7MB/s]
+ 56%|#####6 | 7.60M/13.6M [00:00<00:00, 36.1MB/s]
+ 86%|########6 | 11.7M/13.6M [00:00<00:00, 38.8MB/s]
+100%|##########| 13.6M/13.6M [00:00<00:00, 37.0MB/s]
</pre></div>
</div>
</div>
@@ -546,7 +547,7 @@ output values are identical out of 1000 outputs from mobilenet v2.</p>
<p class="sphx-glr-script-out">Out:</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 90.3550 90.2908 93.6545 90.0487 0.3945
+ 90.3869 90.3271 93.3037 90.2228 0.3143
</pre></div>
</div>
<div class="admonition note">
@@ -585,7 +586,7 @@ This includes support for the VNNI 8 bit dot product instruction (CascadeLake or
<div class="section" id="deploy-a-quantized-tflite-model">
<h2>Deploy a quantized TFLite Model<a class="headerlink" href="#deploy-a-quantized-tflite-model" title="Permalink to this headline">¶</a></h2>
<p>TODO</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 8.392 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 8.345 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-prequantized-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/fb8217c13f4351224c6cf3aacf1a87fc/deploy_prequantized.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_prequantized.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_prequantized_tflite.html b/docs/how_to/deploy_models/deploy_prequantized_tflite.html
index b59c643e4..f4b4b88db 100644
--- a/docs/how_to/deploy_models/deploy_prequantized_tflite.html
+++ b/docs/how_to/deploy_models/deploy_prequantized_tflite.html
@@ -545,7 +545,7 @@ TFLite Top-5 labels: [387 102 386 341 349]
<p class="sphx-glr-script-out">Out:</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 119.9743 119.8743 123.1352 119.1519 0.5602
+ 121.2660 121.1909 123.7286 120.5713 0.4329
</pre></div>
</div>
<div class="admonition note">
@@ -573,7 +573,7 @@ network for ARM CPU</span></a>.</p></li>
</ul>
</div></blockquote>
</div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 59.433 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 55.889 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-prequantized-tflite-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/56691c7a27d45da61d112276334640d3/deploy_prequantized_tflite.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_prequantized_tflite.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_quantized.html b/docs/how_to/deploy_models/deploy_quantized.html
index 9ddb0a986..7d541231f 100644
--- a/docs/how_to/deploy_models/deploy_quantized.html
+++ b/docs/how_to/deploy_models/deploy_quantized.html
@@ -482,7 +482,7 @@ for calibration. But the accuracy might be impacted.</p>
DeprecationWarning,
</pre></div>
</div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 29.627 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 12.332 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-quantized-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/7810ecf51bfc05f7d5e8a400ac3e815d/deploy_quantized.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_quantized.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_ssd_gluoncv.html b/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
index 181fc2a68..f66b4af62 100644
--- a/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
+++ b/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
@@ -415,22 +415,22 @@ to your device.</p>
Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
0%| | 0/132723 [00:00<?, ?KB/s]
- 3%|3 | 4435/132723 [00:00<00:02, 43565.93KB/s]
- 10%|9 | 12736/132723 [00:00<00:01, 66595.18KB/s]
- 16%|#6 | 21534/132723 [00:00<00:01, 76318.71KB/s]
- 23%|##2 | 30396/132723 [00:00<00:01, 81161.08KB/s]
- 30%|##9 | 39264/132723 [00:00<00:01, 83864.43KB/s]
- 36%|###6 | 48180/132723 [00:00<00:00, 85660.24KB/s]
- 43%|####2 | 57015/132723 [00:00<00:00, 86537.00KB/s]
- 50%|####9 | 65862/132723 [00:00<00:00, 87150.29KB/s]
- 56%|#####6 | 74745/132723 [00:00<00:00, 87673.92KB/s]
- 63%|######2 | 83610/132723 [00:01<00:00, 87971.77KB/s]
- 70%|######9 | 92546/132723 [00:01<00:00, 88392.06KB/s]
- 76%|#######6 | 101465/132723 [00:01<00:00, 88632.85KB/s]
- 83%|########3 | 110353/132723 [00:01<00:00, 88706.07KB/s]
- 90%|########9 | 119224/132723 [00:01<00:00, 88578.13KB/s]
- 97%|#########6| 128083/132723 [00:01<00:00, 88200.81KB/s]
-100%|##########| 132723/132723 [00:01<00:00, 85184.13KB/s]
+ 5%|5 | 7131/132723 [00:00<00:01, 71302.59KB/s]
+ 12%|#1 | 15498/132723 [00:00<00:01, 78571.54KB/s]
+ 18%|#8 | 23936/132723 [00:00<00:01, 81217.31KB/s]
+ 24%|##4 | 32465/132723 [00:00<00:01, 82821.55KB/s]
+ 31%|### | 40991/132723 [00:00<00:01, 83698.67KB/s]
+ 37%|###7 | 49571/132723 [00:00<00:00, 84409.61KB/s]
+ 44%|####3 | 58058/132723 [00:00<00:00, 84557.24KB/s]
+ 50%|##### | 66578/132723 [00:00<00:00, 84760.14KB/s]
+ 57%|#####6 | 75055/132723 [00:00<00:00, 82763.80KB/s]
+ 63%|######2 | 83602/132723 [00:01<00:00, 83582.62KB/s]
+ 69%|######9 | 92214/132723 [00:01<00:00, 84348.45KB/s]
+ 76%|#######5 | 100781/132723 [00:01<00:00, 84743.24KB/s]
+ 82%|########2 | 109329/132723 [00:01<00:00, 84961.90KB/s]
+ 89%|########8 | 117995/132723 [00:01<00:00, 85469.97KB/s]
+ 95%|#########5| 126597/132723 [00:01<00:00, 85632.38KB/s]
+100%|##########| 132723/132723 [00:01<00:00, 83967.03KB/s]
</pre></div>
</div>
<p>Create TVM runtime and do inference
@@ -475,7 +475,7 @@ Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from h
</pre></div>
</div>
<img alt="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" class="sphx-glr-single-img" src="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" />
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes 22.655 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes 21.806 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-ssd-gluoncv-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/cccb17d28e5e8b2e94ea8cd5ec59f6ed/deploy_ssd_gluoncv.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_ssd_gluoncv.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/sg_execution_times.html b/docs/how_to/deploy_models/sg_execution_times.html
index 5b01280b2..a6395ade4 100644
--- a/docs/how_to/deploy_models/sg_execution_times.html
+++ b/docs/how_to/deploy_models/sg_execution_times.html
@@ -300,16 +300,16 @@
<div class="section" id="computation-times">
<span id="sphx-glr-how-to-deploy-models-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>10:51.763</strong> total execution time for <strong>how_to_deploy_models</strong> files:</p>
+<p><strong>10:28.339</strong> total execution time for <strong>how_to_deploy_models</strong> files:</p>
<ul class="simple">
-<li><p><strong>02:58.856</strong>: <a class="reference internal" href="deploy_object_detection_pytorch.html#sphx-glr-how-to-deploy-models-deploy-object-detection-pytorch-py"><span class="std std-ref">Compile PyTorch Object Detection Models</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_object_detection_pytorch.py</span></code>)</p></li>
-<li><p><strong>02:22.655</strong>: <a class="reference internal" href="deploy_ssd_gluoncv.html#sphx-glr-how-to-deploy-models-deploy-ssd-gluoncv-py"><span class="std std-ref">Deploy Single Shot Multibox Detector(SSD) model</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_ssd_gluoncv.py</span></code>)</p></li>
-<li><p><strong>01:59.433</strong>: <a class="reference internal" href="deploy_prequantized_tflite.html#sphx-glr-how-to-deploy-models-deploy-prequantized-tflite-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized_tflite.py</span></code>)</p></li>
-<li><p><strong>01:29.627</strong>: <a class="reference internal" href="deploy_quantized.html#sphx-glr-how-to-deploy-models-deploy-quantized-py"><span class="std std-ref">Deploy a Quantized Model on Cuda</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_quantized.py</span></code>)</p></li>
-<li><p><strong>01:08.392</strong>: <a class="reference internal" href="deploy_prequantized.html#sphx-glr-how-to-deploy-models-deploy-prequantized-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized.py</span></code>)</p></li>
-<li><p><strong>00:30.420</strong>: <a class="reference internal" href="deploy_model_on_android.html#sphx-glr-how-to-deploy-models-deploy-model-on-android-py"><span class="std std-ref">Deploy the Pretrained Model on Android</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_android.py</span></code>)</p></li>
-<li><p><strong>00:22.175</strong>: <a class="reference internal" href="deploy_model_on_rasp.html#sphx-glr-how-to-deploy-models-deploy-model-on-rasp-py"><span class="std std-ref">Deploy the Pretrained Model on Raspberry Pi</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_rasp.py</span></code>)</p></li>
-<li><p><strong>00:00.205</strong>: <a class="reference internal" href="deploy_sparse.html#sphx-glr-how-to-deploy-models-deploy-sparse-py"><span class="std std-ref">Deploy a Hugging Face Pruned Model on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_sparse.py</span></code>)</p></li>
+<li><p><strong>02:58.144</strong>: <a class="reference internal" href="deploy_object_detection_pytorch.html#sphx-glr-how-to-deploy-models-deploy-object-detection-pytorch-py"><span class="std std-ref">Compile PyTorch Object Detection Models</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_object_detection_pytorch.py</span></code>)</p></li>
+<li><p><strong>02:21.806</strong>: <a class="reference internal" href="deploy_ssd_gluoncv.html#sphx-glr-how-to-deploy-models-deploy-ssd-gluoncv-py"><span class="std std-ref">Deploy Single Shot Multibox Detector(SSD) model</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_ssd_gluoncv.py</span></code>)</p></li>
+<li><p><strong>01:55.889</strong>: <a class="reference internal" href="deploy_prequantized_tflite.html#sphx-glr-how-to-deploy-models-deploy-prequantized-tflite-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized_tflite.py</span></code>)</p></li>
+<li><p><strong>01:12.332</strong>: <a class="reference internal" href="deploy_quantized.html#sphx-glr-how-to-deploy-models-deploy-quantized-py"><span class="std std-ref">Deploy a Quantized Model on Cuda</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_quantized.py</span></code>)</p></li>
+<li><p><strong>01:08.345</strong>: <a class="reference internal" href="deploy_prequantized.html#sphx-glr-how-to-deploy-models-deploy-prequantized-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized.py</span></code>)</p></li>
+<li><p><strong>00:29.408</strong>: <a class="reference internal" href="deploy_model_on_android.html#sphx-glr-how-to-deploy-models-deploy-model-on-android-py"><span class="std std-ref">Deploy the Pretrained Model on Android</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_android.py</span></code>)</p></li>
+<li><p><strong>00:22.206</strong>: <a class="reference internal" href="deploy_model_on_rasp.html#sphx-glr-how-to-deploy-models-deploy-model-on-rasp-py"><span class="std std-ref">Deploy the Pretrained Model on Raspberry Pi</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_rasp.py</span></code>)</p></li>
+<li><p><strong>00:00.208</strong>: <a class="reference internal" href="deploy_sparse.html#sphx-glr-how-to-deploy-models-deploy-sparse-py"><span class="std std-ref">Deploy a Hugging Face Pruned Model on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_sparse.py</span></code>)</p></li>
</ul>
</div>
diff --git a/docs/how_to/extend_tvm/bring_your_own_datatypes.html b/docs/how_to/extend_tvm/bring_your_own_datatypes.html
index df92dc865..6d183824a 100644
--- a/docs/how_to/extend_tvm/bring_your_own_datatypes.html
+++ b/docs/how_to/extend_tvm/bring_your_own_datatypes.html
@@ -590,7 +590,7 @@ In this alpha state of the Bring Your Own Datatypes framework, we have not imple
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip1beec7b5-11b5-4b0d-a47a-6a5a0f40d2cc from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip5cf78752-2709-4777-b769-e323af70ff83 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
</pre></div>
</div>
<p>It’s easy to execute MobileNet with native TVM:</p>
diff --git a/docs/how_to/extend_tvm/sg_execution_times.html b/docs/how_to/extend_tvm/sg_execution_times.html
index c9c906c5b..512b50b14 100644
--- a/docs/how_to/extend_tvm/sg_execution_times.html
+++ b/docs/how_to/extend_tvm/sg_execution_times.html
@@ -300,11 +300,11 @@
<div class="section" id="computation-times">
<span id="sphx-glr-how-to-extend-tvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:41.304</strong> total execution time for <strong>how_to_extend_tvm</strong> files:</p>
+<p><strong>00:41.867</strong> total execution time for <strong>how_to_extend_tvm</strong> files:</p>
<ul class="simple">
-<li><p><strong>00:37.436</strong>: <a class="reference internal" href="bring_your_own_datatypes.html#sphx-glr-how-to-extend-tvm-bring-your-own-datatypes-py"><span class="std std-ref">Bring Your Own Datatypes to TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">bring_your_own_datatypes.py</span></code>)</p></li>
-<li><p><strong>00:02.498</strong>: <a class="reference internal" href="use_pass_instrument.html#sphx-glr-how-to-extend-tvm-use-pass-instrument-py"><span class="std std-ref">How to Use TVM Pass Instrument</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_instrument.py</span></code>)</p></li>
-<li><p><strong>00:01.155</strong>: <a class="reference internal" href="use_pass_infra.html#sphx-glr-how-to-extend-tvm-use-pass-infra-py"><span class="std std-ref">How to Use TVM Pass Infra</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_infra.py</span></code>)</p></li>
+<li><p><strong>00:38.032</strong>: <a class="reference internal" href="bring_your_own_datatypes.html#sphx-glr-how-to-extend-tvm-bring-your-own-datatypes-py"><span class="std std-ref">Bring Your Own Datatypes to TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">bring_your_own_datatypes.py</span></code>)</p></li>
+<li><p><strong>00:02.477</strong>: <a class="reference internal" href="use_pass_instrument.html#sphx-glr-how-to-extend-tvm-use-pass-instrument-py"><span class="std std-ref">How to Use TVM Pass Instrument</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_instrument.py</span></code>)</p></li>
+<li><p><strong>00:01.144</strong>: <a class="reference internal" href="use_pass_infra.html#sphx-glr-how-to-extend-tvm-use-pass-infra-py"><span class="std std-ref">How to Use TVM Pass Infra</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_infra.py</span></code>)</p></li>
<li><p><strong>00:00.215</strong>: <a class="reference internal" href="low_level_custom_pass.html#sphx-glr-how-to-extend-tvm-low-level-custom-pass-py"><span class="std std-ref">Writing a Customized Pass</span></a> (<code class="docutils literal notranslate"><span class="pre">low_level_custom_pass.py</span></code>)</p></li>
</ul>
</div>
diff --git a/docs/how_to/extend_tvm/use_pass_instrument.html b/docs/how_to/extend_tvm/use_pass_instrument.html
index 6600059b8..205b8a0c0 100644
--- a/docs/how_to/extend_tvm/use_pass_instrument.html
+++ b/docs/how_to/extend_tvm/use_pass_instrument.html
@@ -486,10 +486,10 @@ profile the execution time of each passes.</p>
</div>
<p class="sphx-glr-script-out">Out:</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Printing results of timing profile...
-InferType: 7436us [7436us] (47.14%; 47.14%)
-FoldScaleAxis: 8338us [6us] (52.86%; 52.86%)
- FoldConstant: 8332us [1672us] (52.82%; 99.92%)
- InferType: 6660us [6660us] (42.22%; 79.93%)
+InferType: 6663us [6663us] (45.53%; 45.53%)
+FoldScaleAxis: 7972us [6us] (54.47%; 54.47%)
+ FoldConstant: 7967us [1604us] (54.44%; 99.93%)
+ InferType: 6363us [6363us] (43.48%; 79.87%)
</pre></div>
</div>
</div>
@@ -512,10 +512,10 @@ Refer to following sections and <a class="reference internal" href="../../refere
</div>
<p class="sphx-glr-script-out">Out:</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Printing results of timing profile...
-InferType: 6691us [6691us] (44.81%; 44.81%)
-FoldScaleAxis: 8242us [6us] (55.19%; 55.19%)
- FoldConstant: 8236us [1687us] (55.15%; 99.93%)
- InferType: 6549us [6549us] (43.86%; 79.52%)
+InferType: 7142us [7142us] (47.08%; 47.08%)
+FoldScaleAxis: 8029us [6us] (52.92%; 52.92%)
+ FoldConstant: 8024us [1643us] (52.88%; 99.93%)
+ InferType: 6380us [6380us] (42.05%; 79.52%)
</pre></div>
</div>
<p>Register empty list to clear existing instruments.</p>
diff --git a/docs/how_to/optimize_operators/opt_conv_cuda.html b/docs/how_to/optimize_operators/opt_conv_cuda.html
index 193d7ecec..96d9ede89 100644
--- a/docs/how_to/optimize_operators/opt_conv_cuda.html
+++ b/docs/how_to/optimize_operators/opt_conv_cuda.html
@@ -534,7 +534,7 @@ latency of convolution.</p>
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Convolution: 54.197352 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Convolution: 54.211124 ms
</pre></div>
</div>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-optimize-operators-opt-conv-cuda-py">
diff --git a/docs/how_to/optimize_operators/opt_conv_tensorcore.html b/docs/how_to/optimize_operators/opt_conv_tensorcore.html
index 0b52e3f4e..9ebb3b790 100644
--- a/docs/how_to/optimize_operators/opt_conv_tensorcore.html
+++ b/docs/how_to/optimize_operators/opt_conv_tensorcore.html
@@ -878,7 +878,7 @@ be able to run on our build server</p>
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>conv2d with tensor core: 6.542161 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>conv2d with tensor core: 6.877208 ms
</pre></div>
</div>
</div>
diff --git a/docs/how_to/optimize_operators/opt_gemm.html b/docs/how_to/optimize_operators/opt_gemm.html
index 90568d99e..d2ae772ea 100644
--- a/docs/how_to/optimize_operators/opt_gemm.html
+++ b/docs/how_to/optimize_operators/opt_gemm.html
@@ -431,8 +431,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.019223
-Baseline: 3.278164
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.019617
+Baseline: 3.480961
</pre></div>
</div>
<p>In TVM, we can always inspect lower level IR to debug or optimize our schedule.
@@ -494,7 +494,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt1: 0.318309
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt1: 0.311245
</pre></div>
</div>
<p>Here is the generated IR after blocking.</p>
@@ -563,7 +563,7 @@ vastly.</p>
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt2: 0.343364
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt2: 0.342973
</pre></div>
</div>
<p>Here is the generated IR after vectorization.</p>
@@ -626,7 +626,7 @@ the access pattern for A matrix is more cache friendly.</p>
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt3: 0.125443
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt3: 0.121032
</pre></div>
</div>
<p>Here is the generated IR after loop permutation.</p>
@@ -711,7 +711,7 @@ flattening.</p>
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt4: 0.111377
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt4: 0.111244
</pre></div>
</div>
<p>Here is the generated IR after array packing.</p>
@@ -799,7 +799,7 @@ write to C when all the block results are ready.</p>
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt5: 0.112086
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt5: 0.112666
</pre></div>
</div>
<p>Here is the generated IR after blocking.</p>
@@ -891,7 +891,7 @@ write to C when all the block results are ready.</p>
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt6: 0.146824
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt6: 0.146817
</pre></div>
</div>
<p>Here is the generated IR after parallelization.</p>
diff --git a/docs/how_to/optimize_operators/sg_execution_times.html b/docs/how_to/optimize_operators/sg_execution_times.html
index fdb4f7ba7..64047c6a9 100644
--- a/docs/how_to/optimize_operators/sg_execution_times.html
+++ b/docs/how_to/optimize_operators/sg_execution_times.html
@@ -300,11 +300,11 @@
<div class="section" id="computation-times">
<span id="sphx-glr-how-to-optimize-operators-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:35.260</strong> total execution time for <strong>how_to_optimize_operators</strong> files:</p>
+<p><strong>00:35.905</strong> total execution time for <strong>how_to_optimize_operators</strong> files:</p>
<ul class="simple">
-<li><p><strong>00:32.592</strong>: <a class="reference internal" href="opt_gemm.html#sphx-glr-how-to-optimize-operators-opt-gemm-py"><span class="std std-ref">How to optimize GEMM on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_gemm.py</span></code>)</p></li>
-<li><p><strong>00:01.437</strong>: <a class="reference internal" href="opt_conv_tensorcore.html#sphx-glr-how-to-optimize-operators-opt-conv-tensorcore-py"><span class="std std-ref">How to optimize convolution using TensorCores</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_tensorcore.py</span></code>)</p></li>
-<li><p><strong>00:01.231</strong>: <a class="reference internal" href="opt_conv_cuda.html#sphx-glr-how-to-optimize-operators-opt-conv-cuda-py"><span class="std std-ref">How to optimize convolution on GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_cuda.py</span></code>)</p></li>
+<li><p><strong>00:33.069</strong>: <a class="reference internal" href="opt_gemm.html#sphx-glr-how-to-optimize-operators-opt-gemm-py"><span class="std std-ref">How to optimize GEMM on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_gemm.py</span></code>)</p></li>
+<li><p><strong>00:01.542</strong>: <a class="reference internal" href="opt_conv_tensorcore.html#sphx-glr-how-to-optimize-operators-opt-conv-tensorcore-py"><span class="std std-ref">How to optimize convolution using TensorCores</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_tensorcore.py</span></code>)</p></li>
+<li><p><strong>00:01.295</strong>: <a class="reference internal" href="opt_conv_cuda.html#sphx-glr-how-to-optimize-operators-opt-conv-cuda-py"><span class="std std-ref">How to optimize convolution on GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_cuda.py</span></code>)</p></li>
</ul>
</div>
diff --git a/docs/how_to/tune_with_autoscheduler/sg_execution_times.html b/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
index a9f10d616..9e14f680c 100644
--- a/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
+++ b/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
@@ -300,14 +300,14 @@
<div class="section" id="computation-times">
<span id="sphx-glr-how-to-tune-with-autoscheduler-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>05:23.442</strong> total execution time for <strong>how_to_tune_with_autoscheduler</strong> files:</p>
+<p><strong>05:21.893</strong> total execution time for <strong>how_to_tune_with_autoscheduler</strong> files:</p>
<ul class="simple">
-<li><p><strong>02:32.857</strong>: <a class="reference internal" href="tune_conv2d_layer_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py"><span class="std std-ref">Auto-scheduling a Convolution Layer for GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_layer_cuda.py</span></code>)</p></li>
-<li><p><strong>01:21.762</strong>: <a class="reference internal" href="tune_network_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-x86-py"><span class="std std-ref">Auto-scheduling a Neural Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_x86.py</span></code>)</p></li>
-<li><p><strong>00:43.716</strong>: <a class="reference internal" href="tune_network_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-cuda-py"><span class="std std-ref">Auto-scheduling a Neural Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_cuda.py</span></code>)</p></li>
-<li><p><strong>00:27.262</strong>: <a class="reference internal" href="tune_sparse_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-sparse-x86-py"><span class="std std-ref">Auto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_sparse_x86.py</span></code>)</p></li>
-<li><p><strong>00:09.003</strong>: <a class="reference internal" href="tune_network_mali.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-mali-py"><span class="std std-ref">Auto-scheduling a Neural Network for mali GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_mali.py</span></code>)</p></li>
-<li><p><strong>00:08.842</strong>: <a class="reference internal" href="tune_network_arm.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-arm-py"><span class="std std-ref">Auto-scheduling a Neural Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_arm.py</span></code>)</p></li>
+<li><p><strong>02:42.124</strong>: <a class="reference internal" href="tune_conv2d_layer_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py"><span class="std std-ref">Auto-scheduling a Convolution Layer for GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_layer_cuda.py</span></code>)</p></li>
+<li><p><strong>01:21.321</strong>: <a class="reference internal" href="tune_network_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-x86-py"><span class="std std-ref">Auto-scheduling a Neural Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_x86.py</span></code>)</p></li>
+<li><p><strong>00:43.525</strong>: <a class="reference internal" href="tune_network_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-cuda-py"><span class="std std-ref">Auto-scheduling a Neural Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_cuda.py</span></code>)</p></li>
+<li><p><strong>00:17.397</strong>: <a class="reference internal" href="tune_sparse_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-sparse-x86-py"><span class="std std-ref">Auto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_sparse_x86.py</span></code>)</p></li>
+<li><p><strong>00:08.861</strong>: <a class="reference internal" href="tune_network_mali.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-mali-py"><span class="std std-ref">Auto-scheduling a Neural Network for mali GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_mali.py</span></code>)</p></li>
+<li><p><strong>00:08.665</strong>: <a class="reference internal" href="tune_network_arm.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-arm-py"><span class="std std-ref">Auto-scheduling a Neural Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_arm.py</span></code>)</p></li>
</ul>
</div>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html b/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
index fd8951453..b84106349 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
@@ -470,483 +470,669 @@ cooperative fetching, unrolling and operator fusion.</p>
compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])}
buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute}
preflattened_buffer_map = {data_1: data_3: Buffer(data_2, float32, [1, 512, 7, 7], []), kernel_1: kernel_3: Buffer(kernel_2, float32, [512, 512, 3, 3], []), bias_1: bias_3: Buffer(bias_2, float32, [1, 512, 1, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [1, 512, 7, 7], [])} {
- attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 28;
- allocate(conv2d_nchw: Pointer(local float32), float32, [14]), storage_scope = local;
- allocate(pad_temp.shared: Pointer(shared float32), float32, [72]), storage_scope = shared;
- allocate(kernel.shared: Pointer(shared float32), float32, [3072]), storage_scope = shared;
- attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64 {
- conv2d_nchw_1: Buffer(conv2d_nchw, float32, [14], [], scope="local", align=32)[0] = 0f32
+ attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 32;
+ allocate(conv2d_nchw: Pointer(local float32), float32, [2]), storage_scope = local;
+ allocate(pad_temp.shared: Pointer(shared float32), float32, [2016]), storage_scope = shared;
+ allocate(kernel.shared: Pointer(shared float32), float32, [1536]), storage_scope = shared;
+ attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392 {
+ conv2d_nchw_1: Buffer(conv2d_nchw, float32, [2], [], scope="local", align=8)[0] = 0f32
conv2d_nchw_1[1] = 0f32
- conv2d_nchw_1[2] = 0f32
- conv2d_nchw_1[3] = 0f32
- conv2d_nchw_1[4] = 0f32
- conv2d_nchw_1[5] = 0f32
- conv2d_nchw_1[6] = 0f32
- conv2d_nchw_1[7] = 0f32
- conv2d_nchw_1[8] = 0f32
- conv2d_nchw_1[9] = 0f32
- conv2d_nchw_1[10] = 0f32
- conv2d_nchw_1[11] = 0f32
- conv2d_nchw_1[12] = 0f32
- conv2d_nchw_1[13] = 0f32
- for (rc.outer.outer: int32, 0, 64) {
- for (ry.outer.outer: int32, 0, 3) {
- let cse_var_2: int32 = (rc.outer.outer*72)
- let cse_var_1: int32 = (ry.outer.outer*3)
- {
- attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64 {
- if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
- pad_temp.shared_1: Buffer(pad_temp.shared, float32, [72], [], scope="shared")[(threadIdx.x_1*4)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod((threadIdx.x_1*4), 9))) && (floormod((threadIdx.x_1*4), 9) < 8)), data[((((((rc.outer.outer*392) + (floordiv((threadIdx.x_1*4), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + [...]
- }
- if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
- pad_temp.shared_1[((threadIdx.x_1*4) + 1)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 1), 9))) && (floormod(((threadIdx.x_1*4) + 1), 9) < 8)), data[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 1), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 1), 9)) - 8)], 0 [...]
- }
- if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
- pad_temp.shared_1[((threadIdx.x_1*4) + 2)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 2), 9))) && (floormod(((threadIdx.x_1*4) + 2), 9) < 8)), data[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 2), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 2), 9)) - 8)], 0 [...]
- }
- if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
- pad_temp.shared_1[((threadIdx.x_1*4) + 3)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 3), 9))) && (floormod(((threadIdx.x_1*4) + 3), 9) < 8)), data[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 3), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 3), 9)) - 8)], 0 [...]
- }
- }
- attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1: Buffer(kernel.shared, float32, [3072], [], scope="shared")[threadIdx.x_2] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 64)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 8), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 128)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 16), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 32), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 192)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 36864)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 256)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 32), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 64), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 320)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 40), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 80), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 384)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 73728)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 448)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 56), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 112), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 512)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 64), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 128), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 576)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 110592)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 640)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 80), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 160), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 704)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 88), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 176), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 768)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 147456)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 832)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 104), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 208), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 896)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 112), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 224), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 960)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 184320)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1024)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 128), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 256), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1088)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 136), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 272), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1152)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 221184)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1216)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 152), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 304), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1280)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 160), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 320), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1344)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 258048)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1408)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 176), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 352), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1472)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 184), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 368), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1536)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 294912)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1600)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 200), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 400), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1664)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 208), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 416), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1728)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 331776)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1792)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 224), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 448), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1856)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 232), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 464), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1920)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 368640)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 1984)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 248), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 496), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2048)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 256), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 512), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2112)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 405504)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2176)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 272), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 544), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2240)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 280), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 560), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2304)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 442368)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2368)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 296), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 592), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2432)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 304), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 608), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2496)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 479232)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2560)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 320), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 640), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2624)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 328), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 656), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2688)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 516096)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2752)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 344), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 688), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2816)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 352), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 704), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2880)] = kernel[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 552960)]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 2944)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 368), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 736), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
- attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
- kernel.shared_1[(threadIdx.x_2 + 3008)] = kernel[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((floordiv(threadIdx.x_2, 8) + 376), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 752), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[0]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[9]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[1]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[10]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[2]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[3]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[4]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[5]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[6]*kernel.shared_1[(threadIdx.x*48)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 3)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[0]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[9]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[1]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[10]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 24)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 27)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[1]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[10]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[7]*kernel.shared_1[((threadIdx.x*48) + 1)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[16]*kernel.shared_1[((threadIdx.x*48) + 4)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[1]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[10]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[7]*kernel.shared_1[((threadIdx.x*48) + 25)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[16]*kernel.shared_1[((threadIdx.x*48) + 28)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[7]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[16]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[8]*kernel.shared_1[((threadIdx.x*48) + 2)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[17]*kernel.shared_1[((threadIdx.x*48) + 5)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[2]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[11]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[3]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[12]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[4]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[13]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[5]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[14]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[6]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[15]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[7]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[16]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[8]*kernel.shared_1[((threadIdx.x*48) + 26)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[17]*kernel.shared_1[((threadIdx.x*48) + 29)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[18]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[27]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[19]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[28]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 6)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 9)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[18]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[27]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[19]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[28]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 30)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 33)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[19]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[28]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[25]*kernel.shared_1[((threadIdx.x*48) + 7)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[34]*kernel.shared_1[((threadIdx.x*48) + 10)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[19]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[28]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[25]*kernel.shared_1[((threadIdx.x*48) + 31)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[34]*kernel.shared_1[((threadIdx.x*48) + 34)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[25]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[34]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[26]*kernel.shared_1[((threadIdx.x*48) + 8)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[35]*kernel.shared_1[((threadIdx.x*48) + 11)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[20]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[29]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[21]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[30]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[22]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[31]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[23]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[32]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[24]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[33]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[25]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[34]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[26]*kernel.shared_1[((threadIdx.x*48) + 32)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[35]*kernel.shared_1[((threadIdx.x*48) + 35)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[36]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[45]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[37]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[46]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 12)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 15)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[36]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[45]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[37]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[46]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 36)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 39)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[37]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[46]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[43]*kernel.shared_1[((threadIdx.x*48) + 13)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[52]*kernel.shared_1[((threadIdx.x*48) + 16)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[37]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[46]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[43]*kernel.shared_1[((threadIdx.x*48) + 37)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[52]*kernel.shared_1[((threadIdx.x*48) + 40)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[43]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[52]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[44]*kernel.shared_1[((threadIdx.x*48) + 14)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[53]*kernel.shared_1[((threadIdx.x*48) + 17)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[38]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[47]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[39]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[48]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[40]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[49]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[41]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[50]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[42]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[51]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[43]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[52]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[44]*kernel.shared_1[((threadIdx.x*48) + 38)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[53]*kernel.shared_1[((threadIdx.x*48) + 41)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[54]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[63]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[55]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[64]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 18)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 21)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[54]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[63]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[55]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[64]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 42)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 45)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[55]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[64]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[61]*kernel.shared_1[((threadIdx.x*48) + 19)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[70]*kernel.shared_1[((threadIdx.x*48) + 22)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[55]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[64]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[61]*kernel.shared_1[((threadIdx.x*48) + 43)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[70]*kernel.shared_1[((threadIdx.x*48) + 46)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[61]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[70]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[62]*kernel.shared_1[((threadIdx.x*48) + 20)]))
- conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[71]*kernel.shared_1[((threadIdx.x*48) + 23)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[56]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[65]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[57]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[66]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[58]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[67]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[59]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[68]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[60]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[69]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[61]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[70]*kernel.shared_1[((threadIdx.x*48) + 47)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[62]*kernel.shared_1[((threadIdx.x*48) + 44)]))
- conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[71]*kernel.shared_1[((threadIdx.x*48) + 47)]))
+ for (rc.outer.outer: int32, 0, 16) {
+ let cse_var_2: int32 = (rc.outer.outer*1568)
+ let cse_var_1: int32 = (rc.outer.outer*288)
+ {
+ attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1: Buffer(pad_temp.shared, float32, [2016], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else((((7 <= floormod(threadIdx.x_1, 63)) && (floormod(threadIdx.x_1, 63) < 56)) && (1 <= floormod(threadIdx.x_1, 7))), data[(((cse_var_2 + (floordiv(threadIdx.x_1, 63)*49)) + floormod(threadIdx.x_1, 63)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 2), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 2), 9) < 8)) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 56), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 2), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 4), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 4), 9) < 8)) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 112), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 4), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 6), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 6), 9) < 8)) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 168), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 6), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1568)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 8), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 8), 9) < 8)) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 224), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 8), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_1 < 56), dtype=bool) {
+ pad_temp.shared_1[(threadIdx.x_1 + 1960)] = @tir.if_then_else(((floormod((floordiv(threadIdx.x_1, 7) + 1), 9) < 8) && (1 <= floormod(threadIdx.x_1, 7))), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 280), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 1), 9)*7)) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
}
+ attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1: Buffer(kernel.shared, float32, [1536], [], scope="shared")[threadIdx.x_2] = kernel[((((blockIdx.x*73728) + (floordiv(threadIdx.x_2, 96)*4608)) + cse_var_1) + (floormod(threadIdx.x_2, 96)*3))]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[(((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 49), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 8), 96), 3)*9)) + (floormod((threadIdx.x_2 + 2), 3)*3))]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 784)] = kernel[(((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 98), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 16), 96), 3)*9)) + (floormod((threadIdx.x_2 + 1), 3)*3))]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_2 < 360), dtype=bool) {
+ kernel.shared_1[(threadIdx.x_2 + 1176)] = kernel[(((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 147), 12)*4608)) + cse_var_1) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 32)*9)) + (floormod(threadIdx.x_2, 3)*3))]
+ }
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[(floordiv(threadIdx.x, 49)*192)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 96)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 1)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 97)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 2)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 98)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 3)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 99)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 4)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 100)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 5)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 101)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 6)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 102)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 7)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 103)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 8)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 104)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 9)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 105)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 10)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 106)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 11)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 107)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 12)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 108)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 13)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 109)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 14)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 110)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 15)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 111)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 16)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 112)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 17)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 113)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 18)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 114)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 19)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 115)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 20)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 116)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 21)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 117)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 22)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 118)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 23)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 119)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 24)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 120)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 25)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 121)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 26)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 122)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 27)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 123)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 28)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 124)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 29)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 125)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 30)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 126)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 31)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 127)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 32)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 128)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 33)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 129)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 34)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 130)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 35)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 131)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 36)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 132)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 37)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 133)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 38)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 134)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 39)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 135)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 40)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 136)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 41)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 137)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 42)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 138)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 43)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 139)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 44)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 140)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 45)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 141)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 46)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 142)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 47)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 143)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 48)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 144)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 49)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 145)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 50)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 146)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 51)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 147)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 52)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 148)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 53)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 149)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 54)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 150)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 55)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 151)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 56)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 152)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 57)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 153)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 58)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 154)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 59)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 155)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 60)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 156)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 61)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 157)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 62)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 158)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 63)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 159)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 64)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 160)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 65)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 161)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 66)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 162)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 67)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 163)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 68)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 164)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 69)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 165)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 70)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 166)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 71)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 167)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 72)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 168)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 73)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 169)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 74)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 170)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 75)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 171)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 76)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 172)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 77)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 173)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 78)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 174)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 79)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 175)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 80)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 176)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 81)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 177)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 82)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 178)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 83)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 179)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 84)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 180)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 85)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 181)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 86)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 182)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 87)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 183)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 88)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 184)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 89)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 185)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 90)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 186)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 91)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 187)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 92)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 188)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 93)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 189)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 94)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 190)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 95)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 191)]))
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[threadIdx.x_1] = @tir.if_then_else(((7 <= floormod(threadIdx.x_1, 63)) && (floormod(threadIdx.x_1, 63) < 56)), data[(((cse_var_2 + (floordiv(threadIdx.x_1, 63)*49)) + floormod(threadIdx.x_1, 63)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else(((1 <= floormod((floordiv(threadIdx.x_1, 7) + 2), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 56), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 2), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((1 <= floormod((floordiv(threadIdx.x_1, 7) + 4), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 112), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 4), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else(((1 <= floormod((floordiv(threadIdx.x_1, 7) + 6), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 168), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 6), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1568)] = @tir.if_then_else(((1 <= floormod((floordiv(threadIdx.x_1, 7) + 8), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 224), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 8), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_1 < 56), dtype=bool) {
+ pad_temp.shared_1[(threadIdx.x_1 + 1960)] = @tir.if_then_else((floormod((floordiv(threadIdx.x_1, 7) + 1), 9) < 8), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 280), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 1), 9)*7)) + floormod(threadIdx.x_1, 7)) - 7)], 0f32, dtype=float32)
+ }
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[threadIdx.x_2] = kernel[(((((blockIdx.x*73728) + (floordiv(threadIdx.x_2, 96)*4608)) + cse_var_1) + (floormod(threadIdx.x_2, 96)*3)) + 1)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 49), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 8), 96), 3)*9)) + (floormod((threadIdx.x_2 + 2), 3)*3)) + 1)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 784)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 98), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 16), 96), 3)*9)) + (floormod((threadIdx.x_2 + 1), 3)*3)) + 1)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_2 < 360), dtype=bool) {
+ kernel.shared_1[(threadIdx.x_2 + 1176)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 147), 12)*4608)) + cse_var_1) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 32)*9)) + (floormod(threadIdx.x_2, 3)*3)) + 1)]
+ }
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[(floordiv(threadIdx.x, 49)*192)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 96)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 1)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 97)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 2)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 98)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 3)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 99)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 4)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 100)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 5)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 101)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 6)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 102)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 7)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 103)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 8)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 104)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 9)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 105)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 10)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 106)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 11)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 107)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 12)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 108)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 13)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 109)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 14)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 110)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 15)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 111)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 16)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 112)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 17)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 113)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 18)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 114)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 19)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 115)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 20)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 116)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 21)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 117)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 22)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 118)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 23)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 119)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 24)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 120)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 25)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 121)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 26)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 122)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 27)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 123)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 28)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 124)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 29)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 125)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 30)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 126)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 31)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 127)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 32)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 128)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 33)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 129)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 34)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 130)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 35)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 131)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 36)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 132)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 37)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 133)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 38)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 134)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 39)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 135)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 40)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 136)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 41)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 137)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 42)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 138)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 43)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 139)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 44)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 140)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 45)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 141)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 46)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 142)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 47)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 143)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 48)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 144)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 49)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 145)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 50)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 146)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 51)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 147)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 52)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 148)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 53)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 149)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 54)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 150)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 55)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 151)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 56)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 152)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 57)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 153)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 58)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 154)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 59)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 155)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 60)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 156)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 61)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 157)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 62)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 158)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 63)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 159)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 64)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 160)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 65)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 161)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 66)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 162)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 67)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 163)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 68)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 164)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 69)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 165)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 70)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 166)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 71)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 167)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 72)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 168)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 73)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 169)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 74)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 170)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 75)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 171)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 76)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 172)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 77)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 173)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 78)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 174)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 79)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 175)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 80)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 176)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 81)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 177)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 82)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 178)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 83)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 179)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 84)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 180)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 85)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 181)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 86)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 182)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 87)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 183)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 88)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 184)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 89)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 185)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 90)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 186)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 91)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 187)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 92)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 188)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 93)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 189)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 94)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 190)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 95)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 191)]))
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[threadIdx.x_1] = @tir.if_then_else((((7 <= floormod(threadIdx.x_1, 63)) && (floormod(threadIdx.x_1, 63) < 56)) && (floormod(threadIdx.x_1, 7) < 6)), data[(((cse_var_2 + (floordiv(threadIdx.x_1, 63)*49)) + floormod(threadIdx.x_1, 63)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 2), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 2), 9) < 8)) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 56), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 2), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 4), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 4), 9) < 8)) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 112), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 4), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 6), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 6), 9) < 8)) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 168), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 6), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ pad_temp.shared_1[(threadIdx.x_1 + 1568)] = @tir.if_then_else((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 8), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 8), 9) < 8)) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 224), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 8), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_1 < 56), dtype=bool) {
+ pad_temp.shared_1[(threadIdx.x_1 + 1960)] = @tir.if_then_else(((floormod((floordiv(threadIdx.x_1, 7) + 1), 9) < 8) && (floormod(threadIdx.x_1, 7) < 6)), data[((((cse_var_2 + (floordiv((floordiv(threadIdx.x_1, 7) + 280), 9)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 1), 9)*7)) + floormod(threadIdx.x_1, 7)) - 6)], 0f32, dtype=float32)
+ }
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[threadIdx.x_2] = kernel[(((((blockIdx.x*73728) + (floordiv(threadIdx.x_2, 96)*4608)) + cse_var_1) + (floormod(threadIdx.x_2, 96)*3)) + 2)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 49), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 8), 96), 3)*9)) + (floormod((threadIdx.x_2 + 2), 3)*3)) + 2)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ kernel.shared_1[(threadIdx.x_2 + 784)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 98), 12)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 16), 96), 3)*9)) + (floormod((threadIdx.x_2 + 1), 3)*3)) + 2)]
+ attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+ if @tir.likely((threadIdx.x_2 < 360), dtype=bool) {
+ kernel.shared_1[(threadIdx.x_2 + 1176)] = kernel[((((((blockIdx.x*73728) + (floordiv((floordiv(threadIdx.x_2, 8) + 147), 12)*4608)) + cse_var_1) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 32)*9)) + (floormod(threadIdx.x_2, 3)*3)) + 2)]
+ }
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[(floordiv(threadIdx.x, 49)*192)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[floormod(threadIdx.x, 49)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 96)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 1)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 7)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 97)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 2)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 14)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 98)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 3)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 63)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 99)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 4)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 70)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 100)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 5)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 77)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 101)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 6)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 126)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 102)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 7)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 133)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 103)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 8)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 140)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 104)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 9)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 189)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 105)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 10)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 196)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 106)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 11)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 203)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 107)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 12)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 108)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 13)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 259)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 109)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 14)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 266)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 110)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 15)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 315)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 111)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 16)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 322)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 112)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 17)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 329)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 113)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 18)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 378)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 114)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 19)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 385)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 115)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 20)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 392)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 116)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 21)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 441)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 117)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 22)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 448)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 118)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 23)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 455)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 119)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 24)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 504)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 120)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 25)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 511)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 121)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 26)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 518)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 122)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 27)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 567)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 123)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 28)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 574)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 124)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 29)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 581)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 125)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 30)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 630)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 126)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 31)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 637)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 127)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 32)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 644)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 128)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 33)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 693)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 129)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 34)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 700)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 130)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 35)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 707)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 131)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 36)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 756)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 132)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 37)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 763)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 133)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 38)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 770)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 134)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 39)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 819)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 135)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 40)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 826)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 136)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 41)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 833)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 137)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 42)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 882)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 138)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 43)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 889)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 139)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 44)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 896)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 140)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 45)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 945)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 141)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 46)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 952)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 142)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 47)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 959)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 143)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 48)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1008)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 144)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 49)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1015)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 145)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 50)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1022)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 146)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 51)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1071)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 147)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 52)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1078)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 148)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 53)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1085)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 149)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 54)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1134)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 150)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 55)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1141)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 151)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 56)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1148)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 152)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 57)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1197)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 153)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 58)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1204)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 154)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 59)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1211)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 155)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 60)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1260)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 156)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 61)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1267)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 157)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 62)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1274)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 158)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 63)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1323)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 159)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 64)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1330)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 160)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 65)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1337)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 161)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 66)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1386)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 162)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 67)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1393)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 163)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 68)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1400)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 164)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 69)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1449)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 165)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 70)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1456)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 166)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 71)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1463)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 167)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 72)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1512)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 168)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 73)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1519)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 169)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 74)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1526)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 170)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 75)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1575)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 171)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 76)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1582)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 172)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 77)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1589)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 173)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 78)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1638)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 174)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 79)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1645)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 175)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 80)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1652)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 176)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 81)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1701)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 177)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 82)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1708)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 178)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 83)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1715)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 179)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 84)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1764)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 180)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 85)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1771)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 181)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 86)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1778)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 182)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 87)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1827)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 183)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 88)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1834)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 184)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 89)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1841)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 185)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 90)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1890)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 186)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 91)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1897)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 187)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 92)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1904)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 188)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 93)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1953)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 189)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 94)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1960)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 190)]))
+ conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 95)]))
+ conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(floormod(threadIdx.x, 49) + 1967)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + 191)]))
}
}
for (i1.inner: int32, 0, 2) {
- for (i3.inner: int32, 0, 7) {
- compute[(((((floordiv(blockIdx.x, 7)*6272) + (threadIdx.x*98)) + (i1.inner*49)) + (floormod(blockIdx.x, 7)*7)) + i3.inner)] = max((conv2d_nchw_1[((i1.inner*7) + i3.inner)] + bias[(((floordiv(blockIdx.x, 7)*128) + (threadIdx.x*2)) + i1.inner)]), 0f32)
- }
+ compute[((((blockIdx.x*784) + (floordiv(threadIdx.x, 49)*98)) + (i1.inner*49)) + floormod(threadIdx.x, 49))] = max((conv2d_nchw_1[i1.inner] + bias[(((blockIdx.x*16) + (floordiv(threadIdx.x, 49)*2)) + i1.inner)]), 0f32)
}
}
}
@@ -984,7 +1170,7 @@ cooperative fetching, unrolling and operator fusion.</p>
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 0.362 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 0.274 ms
</pre></div>
</div>
</div>
@@ -1016,34 +1202,34 @@ conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_
conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=2)
-conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=64)
+conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=8)
conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
-conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=1)
+conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7)
conv2d_nchw_yy_o_o_o_o, conv2d_nchw_yy_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_o_i, factor=1)
conv2d_nchw_xx_o_i, conv2d_nchw_xx_i = s[conv2d_nchw].split(conv2d_nchw_xx, factor=1)
-conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=7)
-conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=1)
+conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
+conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
-conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=2)
-conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=4)
+conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=1)
+conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=32)
conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
-conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
+conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=3)
conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
-conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
+conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2d_nc [...]
compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=2)
-compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=64)
+compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=8)
compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
-compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1)
+compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
-compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7)
-compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
+compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
+compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=7)
compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
s[conv2d_nchw].compute_at(s[compute], compute_i3_o_i)
@@ -1063,14 +1249,14 @@ s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64)
+kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=392)
s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
-pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=4)
+pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64)
+pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=392)
s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
-s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 512)
+s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 1024)
s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True)
CUDA source code:
@@ -1088,430 +1274,640 @@ CUDA source code:
#define int64_t long long
#define uint64_t unsigned long long
#endif
-extern "C" __global__ void __launch_bounds__(64) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
- float conv2d_nchw[14];
- __shared__ float pad_temp_shared[72];
- __shared__ float kernel_shared[3072];
+extern "C" __global__ void __launch_bounds__(392) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
+ float conv2d_nchw[2];
+ __shared__ float pad_temp_shared[2016];
+ __shared__ float kernel_shared[1536];
conv2d_nchw[0] = 0.000000e+00f;
conv2d_nchw[1] = 0.000000e+00f;
- conv2d_nchw[2] = 0.000000e+00f;
- conv2d_nchw[3] = 0.000000e+00f;
- conv2d_nchw[4] = 0.000000e+00f;
- conv2d_nchw[5] = 0.000000e+00f;
- conv2d_nchw[6] = 0.000000e+00f;
- conv2d_nchw[7] = 0.000000e+00f;
- conv2d_nchw[8] = 0.000000e+00f;
- conv2d_nchw[9] = 0.000000e+00f;
- conv2d_nchw[10] = 0.000000e+00f;
- conv2d_nchw[11] = 0.000000e+00f;
- conv2d_nchw[12] = 0.000000e+00f;
- conv2d_nchw[13] = 0.000000e+00f;
- for (int rc_outer_outer = 0; rc_outer_outer < 64; ++rc_outer_outer) {
- for (int ry_outer_outer = 0; ry_outer_outer < 3; ++ry_outer_outer) {
- __syncthreads();
- if (((int)threadIdx.x) < 18) {
- pad_temp_shared[(((int)threadIdx.x) * 4)] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= ((((int)threadIdx.x) * 4) % 9))) && (((((int)threadIdx.x) * 4) % 9) < 8)) ? data[((((((rc_outer_outer * 392) + (((((int)threadIdx.x) * 4) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + ((((int)threadIdx.x) * 4) % 9)) - 8)] : 0.000000e+00f);
- }
- if (((int)threadIdx.x) < 18) {
- pad_temp_shared[((((int)threadIdx.x) * 4) + 1)] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 1) % 9))) && ((((((int)threadIdx.x) * 4) + 1) % 9) < 8)) ? data[((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 1) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 1) % 9)) - 8)] : 0.000000e+00f);
- }
- if (((int)threadIdx.x) < 18) {
- pad_temp_shared[((((int)threadIdx.x) * 4) + 2)] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 2) % 9))) && ((((((int)threadIdx.x) * 4) + 2) % 9) < 8)) ? data[((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 2) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 2) % 9)) - 8)] : 0.000000e+00f);
- }
- if (((int)threadIdx.x) < 18) {
- pad_temp_shared[((((int)threadIdx.x) * 4) + 3)] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 3) % 9))) && ((((((int)threadIdx.x) * 4) + 3) % 9) < 8)) ? data[((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 3) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 3) % 9)) - 8)] : 0.000000e+00f);
- }
- kernel_shared[((int)threadIdx.x)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
- kernel_shared[(((int)threadIdx.x) + 64)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 64) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 128)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 128) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 192)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36864)];
- kernel_shared[(((int)threadIdx.x) + 256)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 256) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 320)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 320) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 384)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 73728)];
- kernel_shared[(((int)threadIdx.x) + 448)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 448) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 512)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 512) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 576)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 110592)];
- kernel_shared[(((int)threadIdx.x) + 640)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 640) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 704)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 704) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 768)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 147456)];
- kernel_shared[(((int)threadIdx.x) + 832)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 832) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 896)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 896) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 960)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 184320)];
- kernel_shared[(((int)threadIdx.x) + 1024)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1024) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1088)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1088) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1152)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 221184)];
- kernel_shared[(((int)threadIdx.x) + 1216)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1216) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1280)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1280) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1344)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 258048)];
- kernel_shared[(((int)threadIdx.x) + 1408)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1408) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1472)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1472) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1536)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 294912)];
- kernel_shared[(((int)threadIdx.x) + 1600)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1600) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1664)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1664) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1728)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 331776)];
- kernel_shared[(((int)threadIdx.x) + 1792)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1792) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1856)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1856) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 1920)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 368640)];
- kernel_shared[(((int)threadIdx.x) + 1984)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1984) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2048)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2048) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2112)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 405504)];
- kernel_shared[(((int)threadIdx.x) + 2176)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2176) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2240)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2240) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2304)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 442368)];
- kernel_shared[(((int)threadIdx.x) + 2368)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2368) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2432)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2432) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2496)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 479232)];
- kernel_shared[(((int)threadIdx.x) + 2560)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2560) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2624)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2624) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2688)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 516096)];
- kernel_shared[(((int)threadIdx.x) + 2752)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2752) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2816)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2816) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- kernel_shared[(((int)threadIdx.x) + 2880)] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 552960)];
- kernel_shared[(((int)threadIdx.x) + 2944)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2944) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
- kernel_shared[(((int)threadIdx.x) + 3008)] = kernel[(((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 3008) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
- __syncthreads();
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[0] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[9] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[1] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[10] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[2] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[3] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[4] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[5] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[6] * kernel_shared[(((int)threadIdx.x) * 48)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 3)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[0] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[9] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[1] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[10] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 24)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 27)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[1] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[10] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[7] * kernel_shared[((((int)threadIdx.x) * 48) + 1)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[16] * kernel_shared[((((int)threadIdx.x) * 48) + 4)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[1] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[10] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[7] * kernel_shared[((((int)threadIdx.x) * 48) + 25)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[16] * kernel_shared[((((int)threadIdx.x) * 48) + 28)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[7] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[16] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[8] * kernel_shared[((((int)threadIdx.x) * 48) + 2)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[17] * kernel_shared[((((int)threadIdx.x) * 48) + 5)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[2] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[11] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[3] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[12] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[4] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[13] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[5] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[14] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[6] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[15] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[7] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[16] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[8] * kernel_shared[((((int)threadIdx.x) * 48) + 26)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[17] * kernel_shared[((((int)threadIdx.x) * 48) + 29)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[18] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[27] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[19] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[28] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 6)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 9)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[18] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[27] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[19] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[28] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 30)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 33)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[19] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[28] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[25] * kernel_shared[((((int)threadIdx.x) * 48) + 7)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[34] * kernel_shared[((((int)threadIdx.x) * 48) + 10)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[19] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[28] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[25] * kernel_shared[((((int)threadIdx.x) * 48) + 31)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[34] * kernel_shared[((((int)threadIdx.x) * 48) + 34)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[25] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[34] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[26] * kernel_shared[((((int)threadIdx.x) * 48) + 8)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[35] * kernel_shared[((((int)threadIdx.x) * 48) + 11)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[20] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[29] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[21] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[30] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[22] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[31] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[23] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[32] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[24] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[33] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[25] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[34] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[26] * kernel_shared[((((int)threadIdx.x) * 48) + 32)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[35] * kernel_shared[((((int)threadIdx.x) * 48) + 35)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[36] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[45] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[37] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[46] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 12)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 15)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[36] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[45] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[37] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[46] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 36)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 39)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[37] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[46] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[43] * kernel_shared[((((int)threadIdx.x) * 48) + 13)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[52] * kernel_shared[((((int)threadIdx.x) * 48) + 16)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[37] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[46] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[43] * kernel_shared[((((int)threadIdx.x) * 48) + 37)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[52] * kernel_shared[((((int)threadIdx.x) * 48) + 40)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[43] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[52] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[44] * kernel_shared[((((int)threadIdx.x) * 48) + 14)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[53] * kernel_shared[((((int)threadIdx.x) * 48) + 17)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[38] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[47] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[39] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[48] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[40] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[49] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[41] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[50] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[42] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[51] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[43] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[52] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[44] * kernel_shared[((((int)threadIdx.x) * 48) + 38)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[53] * kernel_shared[((((int)threadIdx.x) * 48) + 41)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[54] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[63] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[55] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[64] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 18)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 21)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[54] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[63] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[55] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[64] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 42)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 45)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[55] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[64] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[61] * kernel_shared[((((int)threadIdx.x) * 48) + 19)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[70] * kernel_shared[((((int)threadIdx.x) * 48) + 22)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[55] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[64] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[61] * kernel_shared[((((int)threadIdx.x) * 48) + 43)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[70] * kernel_shared[((((int)threadIdx.x) * 48) + 46)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[61] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[70] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[62] * kernel_shared[((((int)threadIdx.x) * 48) + 20)]));
- conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[71] * kernel_shared[((((int)threadIdx.x) * 48) + 23)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[56] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[65] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[57] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[66] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[58] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[67] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[59] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[68] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[60] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[69] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[61] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[70] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[62] * kernel_shared[((((int)threadIdx.x) * 48) + 44)]));
- conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[71] * kernel_shared[((((int)threadIdx.x) * 48) + 47)]));
+ for (int rc_outer_outer = 0; rc_outer_outer < 16; ++rc_outer_outer) {
+ __syncthreads();
+ pad_temp_shared[((int)threadIdx.x)] = ((((7 <= (((int)threadIdx.x) % 63)) && ((((int)threadIdx.x) % 63) < 56)) && (1 <= (((int)threadIdx.x) % 7))) ? data[((((rc_outer_outer * 1568) + ((((int)threadIdx.x) / 63) * 49)) + (((int)threadIdx.x) % 63)) - 8)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 392)] = ((((1 <= (((((int)threadIdx.x) / 7) + 2) % 9)) && ((((((int)threadIdx.x) / 7) + 2) % 9) < 8)) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 392) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 2) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 784)] = ((((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 784) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1176)] = ((((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1176) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1568)] = ((((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1568) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ if (((int)threadIdx.x) < 56) {
+ pad_temp_shared[(((int)threadIdx.x) + 1960)] = (((((int)threadIdx.x) < 49) && (1 <= (((int)threadIdx.x) % 7))) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1960) / 63) * 49)) + (((((int)threadIdx.x) / 7) + 1) * 7)) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+ }
+ kernel_shared[((int)threadIdx.x)] = kernel[((((((int)blockIdx.x) * 73728) + ((((int)threadIdx.x) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((int)threadIdx.x) % 96) * 3))];
+ kernel_shared[(((int)threadIdx.x) + 392)] = kernel[(((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 8) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 2) % 3) * 3))];
+ kernel_shared[(((int)threadIdx.x) + 784)] = kernel[(((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 784) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 16) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 1) % 3) * 3))];
+ if (((int)threadIdx.x) < 360) {
+ kernel_shared[(((int)threadIdx.x) + 1176)] = kernel[(((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 1176) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) / 3) + 8) & 31) * 9)) + ((((int)threadIdx.x) % 3) * 3))];
+ }
+ __syncthreads();
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[((((int)threadIdx.x) / 49) * 192)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 96)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 1)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 97)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 2)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 98)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 3)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 99)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 4)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 100)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 5)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 101)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 6)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 102)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 7)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 103)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 8)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 104)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 9)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 105)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 10)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 106)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 11)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 107)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 12)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 108)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 13)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 109)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 14)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 110)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 15)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 111)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 16)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 112)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 17)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 113)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 18)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 114)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 19)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 115)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 20)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 116)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 21)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 117)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 22)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 118)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 23)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 119)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 24)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 120)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 25)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 121)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 26)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 122)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 27)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 123)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 28)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 124)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 29)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 125)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 30)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 126)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 31)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 127)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 32)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 128)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 33)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 129)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 34)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 130)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 35)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 131)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 36)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 132)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 37)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 133)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 38)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 134)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 39)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 135)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 40)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 136)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 41)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 137)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 42)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 138)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 43)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 139)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 44)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 140)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 45)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 141)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 46)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 142)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 47)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 143)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 48)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 144)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 49)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 145)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 50)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 146)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 51)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 147)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 52)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 148)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 53)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 149)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 54)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 150)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 55)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 151)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 56)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 152)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 57)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 153)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 58)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 154)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 59)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 155)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 60)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 156)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 61)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 157)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 62)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 158)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 63)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 159)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 64)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 160)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 65)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 161)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 66)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 162)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 67)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 163)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 68)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 164)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 69)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 165)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 70)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 166)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 71)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 167)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 72)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 168)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 73)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 169)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 74)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 170)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 75)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 171)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 76)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 172)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 77)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 173)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 78)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 174)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 79)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 175)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 80)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 176)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 81)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 177)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 82)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 178)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 83)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 179)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 84)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 180)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 85)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 181)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 86)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 182)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 87)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 183)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 88)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 184)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 89)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 185)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 90)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 186)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 91)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 187)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 92)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 188)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 93)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 189)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 94)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 190)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 95)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 191)]));
+ __syncthreads();
+ pad_temp_shared[((int)threadIdx.x)] = (((7 <= (((int)threadIdx.x) % 63)) && ((((int)threadIdx.x) % 63) < 56)) ? data[((((rc_outer_outer * 1568) + ((((int)threadIdx.x) / 63) * 49)) + (((int)threadIdx.x) % 63)) - 7)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 392)] = (((1 <= (((((int)threadIdx.x) / 7) + 2) % 9)) && ((((((int)threadIdx.x) / 7) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 392) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 2) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 784)] = (((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 784) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1176)] = (((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1176) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1568)] = (((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1568) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
+ if (((int)threadIdx.x) < 56) {
+ pad_temp_shared[(((int)threadIdx.x) + 1960)] = ((((int)threadIdx.x) < 49) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1960) / 63) * 49)) + (((((int)threadIdx.x) / 7) + 1) * 7)) + (((int)threadIdx.x) % 7)) - 7)] : 0.000000e+00f);
}
+ kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 73728) + ((((int)threadIdx.x) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((int)threadIdx.x) % 96) * 3)) + 1)];
+ kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 8) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 2) % 3) * 3)) + 1)];
+ kernel_shared[(((int)threadIdx.x) + 784)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 784) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 16) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 1) % 3) * 3)) + 1)];
+ if (((int)threadIdx.x) < 360) {
+ kernel_shared[(((int)threadIdx.x) + 1176)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 1176) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) / 3) + 8) & 31) * 9)) + ((((int)threadIdx.x) % 3) * 3)) + 1)];
+ }
+ __syncthreads();
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[((((int)threadIdx.x) / 49) * 192)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 96)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 1)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 97)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 2)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 98)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 3)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 99)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 4)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 100)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 5)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 101)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 6)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 102)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 7)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 103)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 8)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 104)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 9)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 105)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 10)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 106)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 11)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 107)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 12)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 108)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 13)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 109)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 14)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 110)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 15)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 111)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 16)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 112)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 17)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 113)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 18)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 114)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 19)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 115)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 20)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 116)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 21)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 117)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 22)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 118)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 23)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 119)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 24)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 120)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 25)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 121)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 26)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 122)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 27)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 123)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 28)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 124)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 29)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 125)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 30)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 126)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 31)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 127)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 32)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 128)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 33)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 129)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 34)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 130)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 35)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 131)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 36)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 132)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 37)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 133)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 38)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 134)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 39)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 135)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 40)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 136)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 41)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 137)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 42)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 138)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 43)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 139)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 44)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 140)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 45)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 141)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 46)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 142)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 47)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 143)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 48)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 144)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 49)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 145)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 50)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 146)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 51)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 147)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 52)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 148)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 53)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 149)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 54)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 150)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 55)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 151)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 56)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 152)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 57)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 153)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 58)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 154)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 59)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 155)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 60)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 156)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 61)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 157)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 62)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 158)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 63)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 159)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 64)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 160)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 65)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 161)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 66)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 162)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 67)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 163)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 68)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 164)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 69)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 165)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 70)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 166)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 71)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 167)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 72)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 168)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 73)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 169)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 74)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 170)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 75)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 171)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 76)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 172)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 77)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 173)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 78)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 174)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 79)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 175)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 80)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 176)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 81)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 177)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 82)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 178)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 83)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 179)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 84)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 180)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 85)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 181)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 86)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 182)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 87)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 183)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 88)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 184)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 89)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 185)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 90)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 186)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 91)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 187)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 92)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 188)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 93)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 189)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 94)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 190)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 95)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 191)]));
+ __syncthreads();
+ pad_temp_shared[((int)threadIdx.x)] = ((((7 <= (((int)threadIdx.x) % 63)) && ((((int)threadIdx.x) % 63) < 56)) && ((((int)threadIdx.x) % 7) < 6)) ? data[((((rc_outer_outer * 1568) + ((((int)threadIdx.x) / 63) * 49)) + (((int)threadIdx.x) % 63)) - 6)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 392)] = ((((1 <= (((((int)threadIdx.x) / 7) + 2) % 9)) && ((((((int)threadIdx.x) / 7) + 2) % 9) < 8)) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 392) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 2) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 784)] = ((((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 784) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1176)] = ((((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1176) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ pad_temp_shared[(((int)threadIdx.x) + 1568)] = ((((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1568) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ if (((int)threadIdx.x) < 56) {
+ pad_temp_shared[(((int)threadIdx.x) + 1960)] = (((((int)threadIdx.x) < 49) && ((((int)threadIdx.x) % 7) < 6)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1960) / 63) * 49)) + (((((int)threadIdx.x) / 7) + 1) * 7)) + (((int)threadIdx.x) % 7)) - 6)] : 0.000000e+00f);
+ }
+ kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 73728) + ((((int)threadIdx.x) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((int)threadIdx.x) % 96) * 3)) + 2)];
+ kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 8) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 2) % 3) * 3)) + 2)];
+ kernel_shared[(((int)threadIdx.x) + 784)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 784) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 16) % 96) / 3) * 9)) + (((((int)threadIdx.x) + 1) % 3) * 3)) + 2)];
+ if (((int)threadIdx.x) < 360) {
+ kernel_shared[(((int)threadIdx.x) + 1176)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 1176) / 96) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) / 3) + 8) & 31) * 9)) + ((((int)threadIdx.x) % 3) * 3)) + 2)];
+ }
+ __syncthreads();
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[((((int)threadIdx.x) / 49) * 192)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((int)threadIdx.x) % 49)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 96)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 1)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 7)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 97)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 2)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 14)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 98)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 3)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 63)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 99)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 4)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 70)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 100)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 5)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 77)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 101)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 6)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 126)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 102)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 7)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 133)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 103)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 8)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 140)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 104)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 9)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 189)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 105)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 10)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 196)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 106)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 11)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 203)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 107)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 12)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 108)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 13)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 259)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 109)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 14)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 266)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 110)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 15)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 315)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 111)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 16)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 322)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 112)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 17)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 329)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 113)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 18)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 378)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 114)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 19)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 385)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 115)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 20)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 392)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 116)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 21)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 441)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 117)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 22)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 448)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 118)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 23)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 455)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 119)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 24)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 504)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 120)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 25)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 511)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 121)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 26)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 518)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 122)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 27)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 567)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 123)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 28)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 574)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 124)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 29)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 581)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 125)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 30)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 630)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 126)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 31)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 637)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 127)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 32)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 644)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 128)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 33)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 693)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 129)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 34)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 700)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 130)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 35)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 707)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 131)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 36)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 756)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 132)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 37)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 763)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 133)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 38)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 770)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 134)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 39)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 819)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 135)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 40)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 826)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 136)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 41)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 833)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 137)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 42)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 882)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 138)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 43)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 889)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 139)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 44)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 896)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 140)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 45)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 945)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 141)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 46)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 952)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 142)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 47)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 959)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 143)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 48)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1008)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 144)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 49)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1015)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 145)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 50)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1022)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 146)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 51)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1071)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 147)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 52)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1078)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 148)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 53)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1085)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 149)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 54)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1134)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 150)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 55)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1141)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 151)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 56)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1148)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 152)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 57)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1197)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 153)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 58)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1204)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 154)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 59)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1211)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 155)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 60)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1260)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 156)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 61)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1267)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 157)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 62)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1274)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 158)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 63)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1323)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 159)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 64)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1330)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 160)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 65)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1337)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 161)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 66)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1386)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 162)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 67)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1393)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 163)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 68)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1400)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 164)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 69)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1449)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 165)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 70)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1456)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 166)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 71)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1463)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 167)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 72)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1512)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 168)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 73)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1519)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 169)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 74)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1526)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 170)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 75)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1575)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 171)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 76)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1582)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 172)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 77)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1589)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 173)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 78)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1638)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 174)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 79)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1645)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 175)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 80)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1652)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 176)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 81)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1701)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 177)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 82)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1708)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 178)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 83)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1715)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 179)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 84)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1764)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 180)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 85)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1771)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 181)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 86)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1778)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 182)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 87)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1827)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 183)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 88)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1834)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 184)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 89)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1841)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 185)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 90)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1890)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 186)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 91)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1897)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 187)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 92)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1904)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 188)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 93)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1953)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 189)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 94)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1960)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 190)]));
+ conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 95)]));
+ conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((int)threadIdx.x) % 49) + 1967)] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + 191)]));
}
for (int i1_inner = 0; i1_inner < 2; ++i1_inner) {
- for (int i3_inner = 0; i3_inner < 7; ++i3_inner) {
- compute[((((((((int)blockIdx.x) / 7) * 6272) + (((int)threadIdx.x) * 98)) + (i1_inner * 49)) + ((((int)blockIdx.x) % 7) * 7)) + i3_inner)] = max((conv2d_nchw[((i1_inner * 7) + i3_inner)] + bias[((((((int)blockIdx.x) / 7) * 128) + (((int)threadIdx.x) * 2)) + i1_inner)]), 0.000000e+00f);
- }
+ compute[((((((int)blockIdx.x) * 784) + ((((int)threadIdx.x) / 49) * 98)) + (i1_inner * 49)) + (((int)threadIdx.x) % 49))] = max((conv2d_nchw[i1_inner] + bias[(((((int)blockIdx.x) * 16) + ((((int)threadIdx.x) / 49) * 2)) + i1_inner)]), 0.000000e+00f);
}
}
</pre></div>
@@ -1549,7 +1945,7 @@ In the example below we resume the status and do more 5 trials.</p>
Get devices for measurement successfully!
</pre></div>
</div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes 32.857 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes 42.124 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_conv2d_layer_cuda.py</span></code></a></p>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html b/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
index eed6cfe31..89433da4c 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
@@ -878,7 +878,7 @@ so we can read the log file and load the best schedules.</p>
Evaluate inference time cost...
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 9.9299 9.9418 9.9527 9.8950 0.0250
+ 9.5497 9.5434 9.5722 9.5336 0.0164
</pre></div>
</div>
</div>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_network_x86.html b/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
index fc156b66e..99d7b0a50 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
@@ -897,7 +897,7 @@ so we can read the log file and load the best schedules.</p>
Evaluate inference time cost...
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
- 759.8272 759.6222 760.7100 759.1494 0.6534
+ 760.3963 760.2333 761.3582 759.5974 0.7280
</pre></div>
</div>
</div>
@@ -919,7 +919,7 @@ to learn how to use the RPC Tracker and RPC Server.
To use the RPC Tracker in auto-scheduler, replace the runner in <code class="code docutils literal notranslate"><span class="pre">TuningOptions</span></code>
with <a class="reference internal" href="../../reference/api/python/auto_scheduler.html#tvm.auto_scheduler.RPCRunner" title="tvm.auto_scheduler.RPCRunner"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.RPCRunner</span></code></a>.</p></li>
</ol>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 21.762 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes 21.321 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autoscheduler-tune-network-x86-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/e416b94ca1090b0897c0f6e0df95b911/tune_network_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_network_x86.py</span></code></a></p>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html b/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
index afa5794ac..edcd6727d 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
@@ -600,30 +600,76 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [65536], []),
compute: Buffer(compute_2: Pointer(float32), float32, [65536], [])}
buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute}
- preflattened_buffer_map = {placeholder_7: placeholder_15: Buffer(placeholder_12, int32, [4916], []), placeholder_9: placeholder_16: Buffer(placeholder_14, float32, [128, 512], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_6: placeholder_17: Buffer(placeholder_11, float32, [4916, 16, 1], []), placeholder_8: placeholder_18: Buffer(placeholder_13, int32, [33], []), placeholder_5: placeholder_19: Buffer(placeholder_10, float32, [128, 256], [])} {
- for (i0.outer.i1.outer.fused: int32, 0, 128) "parallel" {
- allocate(compute_4: Pointer(global float32), float32, [512]), storage_scope = global {
- for (i.outer.inner: int32, 0, 2) {
- for (nb_j.inner: int32, 0, 2) {
- for (i.inner.init: int32, 0, 8) {
- for (j.init: int32, 0, 16) {
- compute_5: Buffer(compute_4, float32, [512], [])[((((i.outer.inner*256) + (i.inner.init*32)) + (nb_j.inner*16)) + j.init)] = 0f32
- }
+ preflattened_buffer_map = {placeholder_7: placeholder_15: Buffer(placeholder_12, int32, [4916], []), placeholder_5: placeholder_16: Buffer(placeholder_10, float32, [128, 256], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_9: placeholder_17: Buffer(placeholder_14, float32, [128, 512], []), placeholder_8: placeholder_18: Buffer(placeholder_13, int32, [33], []), placeholder_6: placeholder_19: Buffer(placeholder_11, float32, [4916, 16, 1], [])} {
+ for (i0.outer.i1.outer.fused: int32, 0, 64) "parallel" {
+ allocate(compute_4: Pointer(global float32), float32, [1024]), storage_scope = global {
+ for (nb_j.inner: int32, 0, 2) {
+ for (i.inner.init: int32, 0, 32) {
+ let cse_var_1: int32 = ((i.inner.init*32) + (nb_j.inner*16))
+ {
+ compute_5: Buffer(compute_4, float32, [1024], [])[cse_var_1] = 0f32
+ compute_5[(cse_var_1 + 1)] = 0f32
+ compute_5[(cse_var_1 + 2)] = 0f32
+ compute_5[(cse_var_1 + 3)] = 0f32
+ compute_5[(cse_var_1 + 4)] = 0f32
+ compute_5[(cse_var_1 + 5)] = 0f32
+ compute_5[(cse_var_1 + 6)] = 0f32
+ compute_5[(cse_var_1 + 7)] = 0f32
+ compute_5[(cse_var_1 + 8)] = 0f32
+ compute_5[(cse_var_1 + 9)] = 0f32
+ compute_5[(cse_var_1 + 10)] = 0f32
+ compute_5[(cse_var_1 + 11)] = 0f32
+ compute_5[(cse_var_1 + 12)] = 0f32
+ compute_5[(cse_var_1 + 13)] = 0f32
+ compute_5[(cse_var_1 + 14)] = 0f32
+ compute_5[(cse_var_1 + 15)] = 0f32
}
- for (elem_idx: int32, 0, let cse_var_1: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner) in (placeholder_3[(cse_var_1 + 1)] - placeholder_3[cse_var_1])) {
- for (i.inner: int32, 0, 8) {
- for (j: int32, 0, 16) {
- let cse_var_3: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)
- let cse_var_2: int32 = ((((i.outer.inner*256) + (i.inner*32)) + (nb_j.inner*16)) + j)
- compute_5[cse_var_2] = (compute_5[cse_var_2] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + j)]*max(placeholder[((((floordiv(i0.outer.i1.outer.fused, 16)*4096) + (i.outer.inner*2048)) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
- }
+ }
+ for (elem_idx: int32, 0, let cse_var_2: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
+ for (i.inner: int32, 0, 32) {
+ let cse_var_21: int32 = (elem_idx*16)
+ let cse_var_20: int32 = ((i.inner*32) + (nb_j.inner*16))
+ let cse_var_19: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)
+ let cse_var_18: int32 = ((floordiv(i0.outer.i1.outer.fused, 16)*8192) + (i.inner*256))
+ let cse_var_17: int32 = (cse_var_20 + 9)
+ let cse_var_16: int32 = (cse_var_20 + 8)
+ let cse_var_15: int32 = (cse_var_20 + 7)
+ let cse_var_14: int32 = (cse_var_20 + 6)
+ let cse_var_13: int32 = (cse_var_20 + 5)
+ let cse_var_12: int32 = (cse_var_20 + 4)
+ let cse_var_11: int32 = (cse_var_20 + 3)
+ let cse_var_10: int32 = (cse_var_20 + 2)
+ let cse_var_9: int32 = (cse_var_20 + 15)
+ let cse_var_8: int32 = (cse_var_20 + 14)
+ let cse_var_7: int32 = (cse_var_20 + 13)
+ let cse_var_6: int32 = (cse_var_20 + 12)
+ let cse_var_5: int32 = (cse_var_20 + 11)
+ let cse_var_4: int32 = (cse_var_20 + 10)
+ let cse_var_3: int32 = (cse_var_20 + 1)
+ {
+ compute_5[cse_var_20] = (compute_5[cse_var_20] + (placeholder_1[((placeholder_3[cse_var_19]*16) + cse_var_21)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_3] = (compute_5[cse_var_3] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 1)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_10] = (compute_5[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 2)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_11] = (compute_5[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 3)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_12] = (compute_5[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 4)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_13] = (compute_5[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 5)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_14] = (compute_5[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 6)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_15] = (compute_5[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 7)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_16] = (compute_5[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 8)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_17] = (compute_5[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 9)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_4] = (compute_5[cse_var_4] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 10)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_5] = (compute_5[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 11)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_6] = (compute_5[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 12)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_7] = (compute_5[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 13)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_8] = (compute_5[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 14)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
+ compute_5[cse_var_9] = (compute_5[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_19]*16) + cse_var_21) + 15)]*max(placeholder[(cse_var_18 + placeholder_2[(placeholder_3[cse_var_19] + elem_idx)])], 0f32)))
}
}
}
}
- for (i0.inner: int32, 0, 16) {
- let cse_var_4: int32 = (((floordiv(i0.outer.i1.outer.fused, 16)*8192) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32))
- compute[ramp(cse_var_4, 1, 32)] = max((compute_5[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_4, 1, 32)]), broadcast(0f32, 32))
+ for (i0.inner: int32, 0, 32) {
+ let cse_var_22: int32 = (((floordiv(i0.outer.i1.outer.fused, 16)*16384) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32))
+ compute[ramp(cse_var_22, 1, 32)] = max((compute_5[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_22, 1, 32)]), broadcast(0f32, 32))
}
}
}
@@ -662,7 +708,7 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
</pre></div>
</div>
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 1.571 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 1.743 ms
</pre></div>
</div>
<div class="admonition note">
diff --git a/docs/how_to/tune_with_autotvm/sg_execution_times.html b/docs/how_to/tune_with_autotvm/sg_execution_times.html
index 46b90a27f..c9675612f 100644
--- a/docs/how_to/tune_with_autotvm/sg_execution_times.html
+++ b/docs/how_to/tune_with_autotvm/sg_execution_times.html
@@ -300,13 +300,13 @@
<div class="section" id="computation-times">
<span id="sphx-glr-how-to-tune-with-autotvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:44.102</strong> total execution time for <strong>how_to_tune_with_autotvm</strong> files:</p>
+<p><strong>00:45.393</strong> total execution time for <strong>how_to_tune_with_autotvm</strong> files:</p>
<ul class="simple">
-<li><p><strong>00:43.179</strong>: <a class="reference internal" href="tune_conv2d_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-conv2d-cuda-py"><span class="std std-ref">Tuning High Performance Convolution on NVIDIA GPUs</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_cuda.py</span></code>)</p></li>
-<li><p><strong>00:00.244</strong>: <a class="reference internal" href="tune_relay_x86.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-x86-py"><span class="std std-ref">Auto-tuning a Convolutional Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_x86.py</span></code>)</p></li>
-<li><p><strong>00:00.227</strong>: <a class="reference internal" href="tune_relay_arm.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-arm-py"><span class="std std-ref">Auto-tuning a Convolutional Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_arm.py</span></code>)</p></li>
-<li><p><strong>00:00.227</strong>: <a class="reference internal" href="tune_relay_mobile_gpu.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-mobile-gpu-py"><span class="std std-ref">Auto-tuning a Convolutional Network for Mobile GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_mobile_gpu.py</span></code>)</p></li>
+<li><p><strong>00:44.476</strong>: <a class="reference internal" href="tune_conv2d_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-conv2d-cuda-py"><span class="std std-ref">Tuning High Performance Convolution on NVIDIA GPUs</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_cuda.py</span></code>)</p></li>
+<li><p><strong>00:00.238</strong>: <a class="reference internal" href="tune_relay_x86.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-x86-py"><span class="std std-ref">Auto-tuning a Convolutional Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_x86.py</span></code>)</p></li>
+<li><p><strong>00:00.229</strong>: <a class="reference internal" href="tune_relay_arm.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-arm-py"><span class="std std-ref">Auto-tuning a Convolutional Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_arm.py</span></code>)</p></li>
<li><p><strong>00:00.226</strong>: <a class="reference internal" href="tune_relay_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-cuda-py"><span class="std std-ref">Auto-tuning a Convolutional Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_cuda.py</span></code>)</p></li>
+<li><p><strong>00:00.225</strong>: <a class="reference internal" href="tune_relay_mobile_gpu.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-mobile-gpu-py"><span class="std std-ref">Auto-tuning a Convolutional Network for Mobile GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_mobile_gpu.py</span></code>)</p></li>
</ul>
</div>
diff --git a/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html b/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
index 54266ad52..404eb5ac0 100644
--- a/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
+++ b/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
@@ -1142,8 +1142,8 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 4, 4, 32]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2885496
-No: 6 GFLOPS: 94.82/94.82 result: MeasureResult(costs=(0.0024414937291666666,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6543385982513428, timestamp=1654935570.08604) [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
-No: 7 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 6 GFLOPS: 110.87/110.87 result: MeasureResult(costs=(0.002088087645833333,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8702428340911865, timestamp=1654980075.9523664) [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
+No: 7 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1266,7 +1266,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 1, 16, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 256, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6225319
-No: 8 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 8 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1389,7 +1389,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 2, 1, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 8, 64]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,943546
-No: 9 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 9 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1512,7 +1512,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 4, 16, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 16, 32]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2868708
-No: 10 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 10 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 142, in build
res = future.result()
File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
@@ -1530,7 +1530,7 @@ No: 10 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
TimeoutError
[('tile_f', [-1, 32, 2, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4691833
-No: 11 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 11 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1653,7 +1653,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 1, 2, 64]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,1042124
-No: 12 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 12 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1776,7 +1776,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 32, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 32, 16]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10013405
-No: 13 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 13 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1899,7 +1899,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 8, 8, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 32]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6732082
-No: 14 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 14 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -2022,7 +2022,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 2, 4, 32]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7536735
-No: 15 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 15 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -2145,7 +2145,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 2, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 128, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,482121
-No: 16 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 16 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -2268,7 +2268,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 2, 1, 16]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 32, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2824525
-No: 17 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 17 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -2391,7 +2391,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 64, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 8, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4559286
-No: 18 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 18 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -2514,7 +2514,7 @@ Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel [('tile_f', [-1, 1, 32, 16]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 512]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9677544
-No: 19 GFLOPS: 0.00/94.82 result: Traceback (most recent call last):
+No: 19 GFLOPS: 0.00/110.87 result: Traceback (most recent call last):
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 721, in __call__
yield remote, remote.load_module(os.path.split(build_result.filename)[1])
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 685, in run_through_rpc
@@ -2602,7 +2602,7 @@ tvm._ffi.base.TVMError: Traceback (most recent call last):
15: _PyEval_EvalFrameDefault
14: 0x0000000000537c30
13: _PyObject_FastCallKeywords
- 12: 0x00007f49fea61fa2
+ 12: 0x00007fc67a97dfa2
11: _ctypes_callproc
10: ffi_call
9: ffi_call_unix64
@@ -2667,7 +2667,7 @@ Traceback (most recent call last):
21: _PyFunction_FastCallKeywords
20: _PyEval_EvalFrameDefault
19: _PyFunction_FastCall [('tile_f', [-1, 8, 2, 16]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6390073
-No: 20 GFLOPS: 144.91/144.91 result: MeasureResult(costs=(0.0015975758700000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4393105506896973, timestamp=1654935595.9084878) [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
+No: 20 GFLOPS: 144.26/144.26 result: MeasureResult(costs=(0.0016047824999999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4284491539001465, timestamp=1654980101.9109926) [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
</pre></div>
</div>
<p>Finally we can inspect the best config from log file, check correctness,
@@ -2706,7 +2706,7 @@ and measure running time.</p>
<p class="sphx-glr-script-out">Out:</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Best config:
[('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
-Time cost of this operator: 0.002003
+Time cost of this operator: 0.001986
</pre></div>
</div>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autotvm-tune-conv2d-cuda-py">
diff --git a/docs/how_to/work_with_microtvm/micro_autotune.html b/docs/how_to/work_with_microtvm/micro_autotune.html
index d5b24be30..86d5d4325 100644
--- a/docs/how_to/work_with_microtvm/micro_autotune.html
+++ b/docs/how_to/work_with_microtvm/micro_autotune.html
@@ -556,10 +556,10 @@ the tuned operator.</p>
########## Build without Autotuning ##########
Node Name Ops Time(us) Time(%) Shape Inputs Outputs
--------- --- -------- ------- ----- ------ -------
-tvmgen_default_fused_nn_contrib_conv2d_NCHWc tvmgen_default_fused_nn_contrib_conv2d_NCHWc 313.0 98.733 (1, 2, 10, 10, 3) 2 1
-tvmgen_default_fused_layout_transform_1 tvmgen_default_fused_layout_transform_1 3.094 0.976 (1, 6, 10, 10) 1 1
-tvmgen_default_fused_layout_transform tvmgen_default_fused_layout_transform 0.923 0.291 (1, 1, 10, 10, 3) 1 1
-Total_time - 317.017 - - - -
+tvmgen_default_fused_nn_contrib_conv2d_NCHWc tvmgen_default_fused_nn_contrib_conv2d_NCHWc 313.0 98.746 (1, 2, 10, 10, 3) 2 1
+tvmgen_default_fused_layout_transform_1 tvmgen_default_fused_layout_transform_1 3.073 0.969 (1, 6, 10, 10) 1 1
+tvmgen_default_fused_layout_transform tvmgen_default_fused_layout_transform 0.904 0.285 (1, 1, 10, 10, 3) 1 1
+Total_time - 316.976 - - - -
</pre></div>
</div>
</div>
@@ -611,10 +611,10 @@ Total_time -
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>########## Build with Autotuning ##########
Node Name Ops Time(us) Time(%) Shape Inputs Outputs
--------- --- -------- ------- ----- ------ -------
-tvmgen_default_fused_nn_contrib_conv2d_NCHWc tvmgen_default_fused_nn_contrib_conv2d_NCHWc 208.1 98.757 (1, 6, 10, 10, 1) 2 1
-tvmgen_default_fused_layout_transform_1 tvmgen_default_fused_layout_transform_1 1.753 0.832 (1, 6, 10, 10) 1 1
-tvmgen_default_fused_layout_transform tvmgen_default_fused_layout_transform 0.866 0.411 (1, 3, 10, 10, 1) 1 1
-Total_time - 210.719 - - - -
+tvmgen_default_fused_nn_contrib_conv2d_NCHWc tvmgen_default_fused_nn_contrib_conv2d_NCHWc 192.7 98.39 (1, 1, 10, 10, 6) 2 1
+tvmgen_default_fused_layout_transform_1 tvmgen_default_fused_layout_transform_1 2.147 1.096 (1, 6, 10, 10) 1 1
+tvmgen_default_fused_layout_transform tvmgen_default_fused_layout_transform 1.007 0.514 (1, 3, 10, 10, 1) 1 1
+Total_time - 195.854 - - - -
</pre></div>
</div>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-work-with-microtvm-micro-autotune-py">
diff --git a/docs/how_to/work_with_microtvm/micro_train.html b/docs/how_to/work_with_microtvm/micro_train.html
index 9c2cb3438..3e200262f 100644
--- a/docs/how_to/work_with_microtvm/micro_train.html
+++ b/docs/how_to/work_with_microtvm/micro_train.html
@@ -552,8 +552,8 @@ objects to other stuff? We can display some examples from our datasets using <co
</div>
<img alt="../../_images/sphx_glr_micro_train_001.png" class="sphx-glr-single-img" src="../../_images/sphx_glr_micro_train_001.png" />
<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/tmp/tmpgctt5uyy/images/target contains 8144 images
-/tmp/tmpgctt5uyy/images/random contains 5000 images
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/tmp/tmpaipbfjd6/images/target contains 8144 images
+/tmp/tmpaipbfjd6/images/random contains 5000 images
</pre></div>
</div>
</div>
@@ -666,11 +666,11 @@ the time on our validation set).</p>
</div>
<p class="sphx-glr-script-out">Out:</p>
<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Epoch 1/3
-328/328 - 54s - loss: 0.2112 - accuracy: 0.9264 - val_loss: 0.1324 - val_accuracy: 0.9569
+328/328 - 55s - loss: 0.2101 - accuracy: 0.9294 - val_loss: 0.1816 - val_accuracy: 0.9430
Epoch 2/3
-328/328 - 52s - loss: 0.1006 - accuracy: 0.9627 - val_loss: 0.1318 - val_accuracy: 0.9622
+328/328 - 52s - loss: 0.0969 - accuracy: 0.9635 - val_loss: 0.1619 - val_accuracy: 0.9543
Epoch 3/3
-328/328 - 52s - loss: 0.0694 - accuracy: 0.9735 - val_loss: 0.1206 - val_accuracy: 0.9596
+328/328 - 52s - loss: 0.0644 - accuracy: 0.9760 - val_loss: 0.1358 - val_accuracy: 0.9588
</pre></div>
</div>
</div>
@@ -959,7 +959,7 @@ as intended.</p>
<p>From here, we could modify the model to read live images from the camera - we have another
Arduino tutorial for how to do that <a class="reference external" href="https://github.com/guberti/tvm-arduino-demos/tree/master/examples/person_detection">on GitHub</a>. Alternatively, we could also
<a class="reference external" href="https://tvm.apache.org/docs/how_to/work_with_microtvm/micro_autotune.html">use TVM’s autotuning capabilities</a> to dramatically improve the model’s performance.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 5 minutes 34.618 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 4 minutes 28.596 seconds)</p>
<div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-work-with-microtvm-micro-train-py">
<div class="sphx-glr-download docutils container">
<p><a class="reference download internal" download="" href="../../_downloads/b52cec46baf4f78d6bcd94cbe269c8a6/micro_train.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">micro_train.py</span></code></a></p>
diff --git a/docs/how_to/work_with_microtvm/sg_execution_times.html b/docs/how_to/work_with_microtvm/sg_execution_times.html
index 20a052a9e..0c6d58c39 100644
--- a/docs/how_to/work_with_microtvm/sg_execution_times.html
+++ b/docs/how_to/work_with_microtvm/sg_execution_times.html
@@ -300,14 +300,14 @@
<div class="section" id="computation-times">
<span id="sphx-glr-how-to-work-with-microtvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>06:21.428</strong> total execution time for <strong>how_to_work_with_microtvm</strong> files:</p>
+<p><strong>05:16.413</strong> total execution time for <strong>how_to_work_with_microtvm</strong> files:</p>
<ul class="simple">
-<li><p><strong>05:34.618</strong>: <a class="reference internal" href="micro_train.html#sphx-glr-how-to-work-with-microtvm-micro-train-py"><span class="std std-ref">Training Vision Models for microTVM on Arduino</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_train.py</span></code>)</p></li>
-<li><p><strong>00:42.438</strong>: <a class="reference internal" href="micro_autotune.html#sphx-glr-how-to-work-with-microtvm-micro-autotune-py"><span class="std std-ref">Autotuning with microTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_autotune.py</span></code>)</p></li>
-<li><p><strong>00:03.751</strong>: <a class="reference internal" href="micro_tflite.html#sphx-glr-how-to-work-with-microtvm-micro-tflite-py"><span class="std std-ref">microTVM with TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tflite.py</span></code>)</p></li>
-<li><p><strong>00:00.209</strong>: <a class="reference internal" href="micro_tvmc.html#sphx-glr-how-to-work-with-microtvm-micro-tvmc-py"><span class="std std-ref">Executing a Tiny Model with TVMC Micro</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tvmc.py</span></code>)</p></li>
-<li><p><strong>00:00.207</strong>: <a class="reference internal" href="micro_ethosu.html#sphx-glr-how-to-work-with-microtvm-micro-ethosu-py"><span class="std std-ref">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_ethosu.py</span></code>)</p></li>
-<li><p><strong>00:00.206</strong>: <a class="reference internal" href="micro_reference_vm.html#sphx-glr-how-to-work-with-microtvm-micro-reference-vm-py"><span class="std std-ref">microTVM Reference Virtual Machines</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_reference_vm.py</span></code>)</p></li>
+<li><p><strong>04:28.596</strong>: <a class="reference internal" href="micro_train.html#sphx-glr-how-to-work-with-microtvm-micro-train-py"><span class="std std-ref">Training Vision Models for microTVM on Arduino</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_train.py</span></code>)</p></li>
+<li><p><strong>00:43.445</strong>: <a class="reference internal" href="micro_autotune.html#sphx-glr-how-to-work-with-microtvm-micro-autotune-py"><span class="std std-ref">Autotuning with microTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_autotune.py</span></code>)</p></li>
+<li><p><strong>00:03.755</strong>: <a class="reference internal" href="micro_tflite.html#sphx-glr-how-to-work-with-microtvm-micro-tflite-py"><span class="std std-ref">microTVM with TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tflite.py</span></code>)</p></li>
+<li><p><strong>00:00.208</strong>: <a class="reference internal" href="micro_tvmc.html#sphx-glr-how-to-work-with-microtvm-micro-tvmc-py"><span class="std std-ref">Executing a Tiny Model with TVMC Micro</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tvmc.py</span></code>)</p></li>
+<li><p><strong>00:00.205</strong>: <a class="reference internal" href="micro_ethosu.html#sphx-glr-how-to-work-with-microtvm-micro-ethosu-py"><span class="std std-ref">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_ethosu.py</span></code>)</p></li>
+<li><p><strong>00:00.203</strong>: <a class="reference internal" href="micro_reference_vm.html#sphx-glr-how-to-work-with-microtvm-micro-reference-vm-py"><span class="std std-ref">microTVM Reference Virtual Machines</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_reference_vm.py</span></code>)</p></li>
</ul>
</div>
diff --git a/docs/how_to/work_with_relay/sg_execution_times.html b/docs/how_to/work_with_relay/sg_execution_times.html
index 749116d07..53eda3acd 100644
--- a/docs/how_to/work_with_relay/sg_execution_times.html
+++ b/docs/how_to/work_with_relay/sg_execution_times.html
@@ -300,11 +300,11 @@
<div class="section" id="computation-times">
<span id="sphx-glr-how-to-work-with-relay-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:06.380</strong> total execution time for <strong>how_to_work_with_relay</strong> files:</p>
+<p><strong>00:12.267</strong> total execution time for <strong>how_to_work_with_relay</strong> files:</p>
<ul class="simple">
-<li><p><strong>00:04.582</strong>: <a class="reference internal" href="using_external_lib.html#sphx-glr-how-to-work-with-relay-using-external-lib-py"><span class="std std-ref">Using External Libraries in Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_external_lib.py</span></code>)</p></li>
-<li><p><strong>00:01.572</strong>: <a class="reference internal" href="build_gcn.html#sphx-glr-how-to-work-with-relay-build-gcn-py"><span class="std std-ref">Building a Graph Convolutional Network</span></a> (<code class="docutils literal notranslate"><span class="pre">build_gcn.py</span></code>)</p></li>
-<li><p><strong>00:00.226</strong>: <a class="reference internal" href="using_relay_viz.html#sphx-glr-how-to-work-with-relay-using-relay-viz-py"><span class="std std-ref">Use Relay Visualizer to Visualize Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_relay_viz.py</span></code>)</p></li>
+<li><p><strong>00:10.199</strong>: <a class="reference internal" href="using_external_lib.html#sphx-glr-how-to-work-with-relay-using-external-lib-py"><span class="std std-ref">Using External Libraries in Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_external_lib.py</span></code>)</p></li>
+<li><p><strong>00:01.840</strong>: <a class="reference internal" href="build_gcn.html#sphx-glr-how-to-work-with-relay-build-gcn-py"><span class="std std-ref">Building a Graph Convolutional Network</span></a> (<code class="docutils literal notranslate"><span class="pre">build_gcn.py</span></code>)</p></li>
+<li><p><strong>00:00.228</strong>: <a class="reference internal" href="using_relay_viz.html#sphx-glr-how-to-work-with-relay-using-relay-viz-py"><span class="std std-ref">Use Relay Visualizer to Visualize Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_relay_viz.py</span></code>)</p></li>
</ul>
</div>
diff --git a/docs/how_to/work_with_schedules/sg_execution_times.html b/docs/how_to/work_with_schedules/sg_execution_times.html
index ebe7005bc..d374daf7c 100644
--- a/docs/how_to/work_with_schedules/sg_execution_times.html
+++ b/docs/how_to/work_with_schedules/sg_execution_times.html
@@ -300,14 +300,14 @@
<div class="section" id="computation-times">
<span id="sphx-glr-how-to-work-with-schedules-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:05.316</strong> total execution time for <strong>how_to_work_with_schedules</strong> files:</p>
+<p><strong>00:06.103</strong> total execution time for <strong>how_to_work_with_schedules</strong> files:</p>
<ul class="simple">
-<li><p><strong>00:01.996</strong>: <a class="reference internal" href="intrin_math.html#sphx-glr-how-to-work-with-schedules-intrin-math-py"><span class="std std-ref">Intrinsics and Math Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">intrin_math.py</span></code>)</p></li>
-<li><p><strong>00:00.854</strong>: <a class="reference internal" href="tensorize.html#sphx-glr-how-to-work-with-schedules-tensorize-py"><span class="std std-ref">Use Tensorize to Leverage Hardware Intrinsics</span></a> (<code class="docutils literal notranslate"><span class="pre">tensorize.py</span></code>)</p></li>
-<li><p><strong>00:00.720</strong>: <a class="reference internal" href="reduction.html#sphx-glr-how-to-work-with-schedules-reduction-py"><span class="std std-ref">Reduction</span></a> (<code class="docutils literal notranslate"><span class="pre">reduction.py</span></code>)</p></li>
-<li><p><strong>00:00.698</strong>: <a class="reference internal" href="scan.html#sphx-glr-how-to-work-with-schedules-scan-py"><span class="std std-ref">Scan and Recurrent Kernel</span></a> (<code class="docutils literal notranslate"><span class="pre">scan.py</span></code>)</p></li>
-<li><p><strong>00:00.320</strong>: <a class="reference internal" href="extern_op.html#sphx-glr-how-to-work-with-schedules-extern-op-py"><span class="std std-ref">External Tensor Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">extern_op.py</span></code>)</p></li>
-<li><p><strong>00:00.254</strong>: <a class="reference internal" href="schedule_primitives.html#sphx-glr-how-to-work-with-schedules-schedule-primitives-py"><span class="std std-ref">Schedule Primitives in TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">schedule_primitives.py</span></code>)</p></li>
+<li><p><strong>00:02.245</strong>: <a class="reference internal" href="intrin_math.html#sphx-glr-how-to-work-with-schedules-intrin-math-py"><span class="std std-ref">Intrinsics and Math Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">intrin_math.py</span></code>)</p></li>
+<li><p><strong>00:01.263</strong>: <a class="reference internal" href="tensorize.html#sphx-glr-how-to-work-with-schedules-tensorize-py"><span class="std std-ref">Use Tensorize to Leverage Hardware Intrinsics</span></a> (<code class="docutils literal notranslate"><span class="pre">tensorize.py</span></code>)</p></li>
+<li><p><strong>00:00.780</strong>: <a class="reference internal" href="reduction.html#sphx-glr-how-to-work-with-schedules-reduction-py"><span class="std std-ref">Reduction</span></a> (<code class="docutils literal notranslate"><span class="pre">reduction.py</span></code>)</p></li>
+<li><p><strong>00:00.761</strong>: <a class="reference internal" href="scan.html#sphx-glr-how-to-work-with-schedules-scan-py"><span class="std std-ref">Scan and Recurrent Kernel</span></a> (<code class="docutils literal notranslate"><span class="pre">scan.py</span></code>)</p></li>
+<li><p><strong>00:00.318</strong>: <a class="reference internal" href="extern_op.html#sphx-glr-how-to-work-with-schedules-extern-op-py"><span class="std std-ref">External Tensor Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">extern_op.py</span></code>)</p></li>
+<li><p><strong>00:00.262</strong>: <a class="reference internal" href="schedule_primitives.html#sphx-glr-how-to-work-with-schedules-schedule-primitives-py"><span class="std std-ref">Schedule Primitives in TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">schedule_primitives.py</span></code>)</p></li>
<li><p><strong>00:00.244</strong>: <a class="reference internal" href="tedd.html#sphx-glr-how-to-work-with-schedules-tedd-py"><span class="std std-ref">Use Tensor Expression Debug Display (TEDD) for Visualization</span></a> (<code class="docutils literal notranslate"><span class="pre">tedd.py</span></code>)</p></li>
<li><p><strong>00:00.231</strong>: <a class="reference internal" href="tuple_inputs.html#sphx-glr-how-to-work-with-schedules-tuple-inputs-py"><span class="std std-ref">Compute and Reduce with Tuple Inputs</span></a> (<code class="docutils literal notranslate"><span class="pre">tuple_inputs.py</span></code>)</p></li>
</ul>
diff --git a/docs/how_to/work_with_schedules/tensorize.html b/docs/how_to/work_with_schedules/tensorize.html
index 005743aec..8dd85fe0e 100644
--- a/docs/how_to/work_with_schedules/tensorize.html
+++ b/docs/how_to/work_with_schedules/tensorize.html
@@ -552,7 +552,7 @@ The importing needs to happen before the tensorized GEMV being executed.</p>
C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
buffer_map = {A_1: A, B_1: B, C_1: C}
preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 64], []), B_1: B_3: Buffer(B_2, float32, [512, 64], []), C_1: C_3: Buffer(C_2, float32, [1024, 512], [])} {
- attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmp14y5jrep/input0.cc'\nsource_filename = \"/tmp/tmp14y5jrep/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n %7 = allo [...]
+ attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmpin6d2mbk/input0.cc'\nsource_filename = \"/tmp/tmpin6d2mbk/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n %7 = allo [...]
for (i, 0, 1024) {
for (j.outer: int32, 0, 32) {
@tir.call_extern("gemv_update", @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i*512) + (j.outer*16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i*64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer*1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
diff --git a/docs/reference/api/doxygen/classtvm_1_1BaseExpr__inherit__graph.svg b/docs/reference/api/doxygen/classtvm_1_1BaseExpr__inherit__graph.svg
index 8003d6d37..2af64d2f5 100644
--- a/docs/reference/api/doxygen/classtvm_1_1BaseExpr__inherit__graph.svg
+++ b/docs/reference/api/doxygen/classtvm_1_1BaseExpr__inherit__graph.svg
@@ -955,22 +955,24 @@
<!-- Node51 -->
<g id="node45" class="node">
<title>Node51</title>
-<g id="a_node45"><a xlink:href="classtvm_1_1relay_1_1Constant.html" target="_top" xlink:title="{tvm::relay::Constant\n||+ Constant()\l+ TVM_DEFINE_OBJECT_REF\l_METHODS()\l}">
-<polygon fill="#ffffff" stroke="#000000" points="6794,-22.5 6794,-101.5 6948,-101.5 6948,-22.5 6794,-22.5"/>
-<text text-anchor="middle" x="6871" y="-89.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::relay::Constant</text>
-<polyline fill="none" stroke="#000000" points="6794,-82.5 6948,-82.5 "/>
-<text text-anchor="middle" x="6871" y="-70.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
-<polyline fill="none" stroke="#000000" points="6794,-63.5 6948,-63.5 "/>
-<text text-anchor="start" x="6802" y="-51.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ Constant()</text>
-<text text-anchor="start" x="6802" y="-40.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
-<text text-anchor="start" x="6802" y="-29.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<g id="a_node45"><a xlink:href="classtvm_1_1relay_1_1Constant.html" target="_top" xlink:title="{tvm::relay::Constant\n||+ Constant()\l+ TVM_DEFINE_OBJECT_REF\l_METHODS()\l+ TVM_DEFINE_OBJECT_REF\l_COW_METHOD()\l}">
+<polygon fill="#ffffff" stroke="#000000" points="6794,-11.5 6794,-112.5 6948,-112.5 6948,-11.5 6794,-11.5"/>
+<text text-anchor="middle" x="6871" y="-100.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::relay::Constant</text>
+<polyline fill="none" stroke="#000000" points="6794,-93.5 6948,-93.5 "/>
+<text text-anchor="middle" x="6871" y="-81.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
+<polyline fill="none" stroke="#000000" points="6794,-74.5 6948,-74.5 "/>
+<text text-anchor="start" x="6802" y="-62.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ Constant()</text>
+<text text-anchor="start" x="6802" y="-51.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="6802" y="-40.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<text text-anchor="start" x="6802" y="-29.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="6802" y="-18.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_COW_METHOD()</text>
</a>
</g>
</g>
<!-- Node43->Node51 -->
<g id="edge44" class="edge">
<title>Node43->Node51</title>
-<path fill="none" stroke="#191970" d="M6746.2605,-170.0592C6770.9089,-148.7069 6800.9245,-122.705 6825.3349,-101.5587"/>
+<path fill="none" stroke="#191970" d="M6746.1183,-170.1824C6766.6105,-152.4304 6790.8334,-131.4467 6812.4658,-112.707"/>
<polygon fill="none" stroke="#191970" points="6743.727,-167.6233 6738.4603,-176.8164 6748.3103,-172.9142 6743.727,-167.6233"/>
</g>
<!-- Node52 -->
diff --git a/docs/reference/api/doxygen/classtvm_1_1RelayExpr__inherit__graph.svg b/docs/reference/api/doxygen/classtvm_1_1RelayExpr__inherit__graph.svg
index 9c62e832f..52f719fe7 100644
--- a/docs/reference/api/doxygen/classtvm_1_1RelayExpr__inherit__graph.svg
+++ b/docs/reference/api/doxygen/classtvm_1_1RelayExpr__inherit__graph.svg
@@ -135,22 +135,24 @@
<!-- Node10 -->
<g id="node11" class="node">
<title>Node10</title>
-<g id="a_node11"><a xlink:href="classtvm_1_1relay_1_1Constant.html" target="_top" xlink:title="{tvm::relay::Constant\n||+ Constant()\l+ TVM_DEFINE_OBJECT_REF\l_METHODS()\l}">
-<polygon fill="#ffffff" stroke="#000000" points="884,-160.5 884,-239.5 1038,-239.5 1038,-160.5 884,-160.5"/>
-<text text-anchor="middle" x="961" y="-227.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::relay::Constant</text>
-<polyline fill="none" stroke="#000000" points="884,-220.5 1038,-220.5 "/>
-<text text-anchor="middle" x="961" y="-208.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
-<polyline fill="none" stroke="#000000" points="884,-201.5 1038,-201.5 "/>
-<text text-anchor="start" x="892" y="-189.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ Constant()</text>
-<text text-anchor="start" x="892" y="-178.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
-<text text-anchor="start" x="892" y="-167.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<g id="a_node11"><a xlink:href="classtvm_1_1relay_1_1Constant.html" target="_top" xlink:title="{tvm::relay::Constant\n||+ Constant()\l+ TVM_DEFINE_OBJECT_REF\l_METHODS()\l+ TVM_DEFINE_OBJECT_REF\l_COW_METHOD()\l}">
+<polygon fill="#ffffff" stroke="#000000" points="884,-149.5 884,-250.5 1038,-250.5 1038,-149.5 884,-149.5"/>
+<text text-anchor="middle" x="961" y="-238.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::relay::Constant</text>
+<polyline fill="none" stroke="#000000" points="884,-231.5 1038,-231.5 "/>
+<text text-anchor="middle" x="961" y="-219.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
+<polyline fill="none" stroke="#000000" points="884,-212.5 1038,-212.5 "/>
+<text text-anchor="start" x="892" y="-200.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ Constant()</text>
+<text text-anchor="start" x="892" y="-189.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="892" y="-178.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<text text-anchor="start" x="892" y="-167.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="892" y="-156.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_COW_METHOD()</text>
</a>
</g>
</g>
<!-- Node0->Node10 -->
<g id="edge10" class="edge">
<title>Node0->Node10</title>
-<path fill="none" stroke="#191970" d="M1303.4064,-324.834C1231.9811,-316.196 1129.7436,-298.2968 1047,-262 1033.6841,-256.1587 1020.3968,-248.0184 1008.3885,-239.5541"/>
+<path fill="none" stroke="#191970" d="M1303.4064,-324.834C1231.9811,-316.196 1129.7436,-298.2968 1047,-262 1039.7521,-258.8206 1032.5127,-254.96 1025.4834,-250.7368"/>
<polygon fill="none" stroke="#191970" points="1303.3755,-328.3543 1313.715,-326.0424 1304.1905,-321.4019 1303.3755,-328.3543"/>
</g>
<!-- Node11 -->
diff --git a/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant-members.html b/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant-members.html
index 3add01011..b301d5f9e 100644
--- a/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant-members.html
+++ b/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant-members.html
@@ -87,11 +87,12 @@ $(function() {
<tr><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#a4744bf4a1b48f202d41b51dc5e08e6ee">operator<</a>(const ObjectRef &other) const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
<tr class="even"><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#affdf1b8cdb36e140de7b3ad7064e4617">operator==</a>(const ObjectRef &other) const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
<tr><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#ae31a5b9f40781d60a2901994ead700e8">same_as</a>(const ObjectRef &other) const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
- <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1relay_1_1Constant.html#a15da36fc6e073997d7363e1979eda258">TVM_DEFINE_OBJECT_REF_METHODS</a>(Constant, RelayExpr, ConstantNode)</td><td class="entry"><a class="el" href="classtvm_1_1relay_1_1Constant.html">tvm::relay::Constant</a></td><td class="entry"></td></tr>
- <tr><td class="entry"><a class="el" href="classtvm_1_1RelayExpr.html#aaeb30fa197e3d25cd7897f3ac4dcef9d">tvm::RelayExpr::TVM_DEFINE_OBJECT_REF_METHODS</a>(RelayExpr, BaseExpr, RelayExprNode)</td><td class="entry"><a class="el" href="classtvm_1_1RelayExpr.html">tvm::RelayExpr</a></td><td class="entry"></td></tr>
- <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1BaseExpr.html#aa513c6abed6e5b76c7fc9441649b3e4c">tvm::BaseExpr::TVM_DEFINE_OBJECT_REF_METHODS</a>(BaseExpr, ObjectRef, BaseExprNode)</td><td class="entry"><a class="el" href="classtvm_1_1BaseExpr.html">tvm::BaseExpr</a></td><td class="entry"></td></tr>
- <tr><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#a4e7cdb1574b93a59e784d70aa47b8da7">unique</a>() const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
- <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#a0ae0da21d247cd87ea94fe3777c4405e">use_count</a>() const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+ <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1relay_1_1Constant.html#a0d1cb4aa284cd726e7efb9cacf061540">TVM_DEFINE_OBJECT_REF_COW_METHOD</a>(ConstantNode)</td><td class="entry"><a class="el" href="classtvm_1_1relay_1_1Constant.html">tvm::relay::Constant</a></td><td class="entry"></td></tr>
+ <tr><td class="entry"><a class="el" href="classtvm_1_1relay_1_1Constant.html#a15da36fc6e073997d7363e1979eda258">TVM_DEFINE_OBJECT_REF_METHODS</a>(Constant, RelayExpr, ConstantNode)</td><td class="entry"><a class="el" href="classtvm_1_1relay_1_1Constant.html">tvm::relay::Constant</a></td><td class="entry"></td></tr>
+ <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1RelayExpr.html#aaeb30fa197e3d25cd7897f3ac4dcef9d">tvm::RelayExpr::TVM_DEFINE_OBJECT_REF_METHODS</a>(RelayExpr, BaseExpr, RelayExprNode)</td><td class="entry"><a class="el" href="classtvm_1_1RelayExpr.html">tvm::RelayExpr</a></td><td class="entry"></td></tr>
+ <tr><td class="entry"><a class="el" href="classtvm_1_1BaseExpr.html#aa513c6abed6e5b76c7fc9441649b3e4c">tvm::BaseExpr::TVM_DEFINE_OBJECT_REF_METHODS</a>(BaseExpr, ObjectRef, BaseExprNode)</td><td class="entry"><a class="el" href="classtvm_1_1BaseExpr.html">tvm::BaseExpr</a></td><td class="entry"></td></tr>
+ <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#a4e7cdb1574b93a59e784d70aa47b8da7">unique</a>() const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+ <tr><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#a0ae0da21d247cd87ea94fe3777c4405e">use_count</a>() const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
</table></div><!-- contents -->
<!-- start footer part -->
<hr class="footer"/><address class="footer"><small>
diff --git a/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant.html b/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant.html
index 04f70ee00..5df3d74ff 100644
--- a/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant.html
+++ b/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant.html
@@ -74,13 +74,13 @@ $(function() {
<div class="dynheader">
Inheritance diagram for tvm::relay::Constant:</div>
<div class="dyncontent">
-<div class="center"><iframe scrolling="no" frameborder="0" src="classtvm_1_1relay_1_1Constant__inherit__graph.svg" width="216" height="758"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe>
+<div class="center"><iframe scrolling="no" frameborder="0" src="classtvm_1_1relay_1_1Constant__inherit__graph.svg" width="216" height="787"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe>
</div>
</div>
<div class="dynheader">
Collaboration diagram for tvm::relay::Constant:</div>
<div class="dyncontent">
-<div class="center"><iframe scrolling="no" frameborder="0" src="classtvm_1_1relay_1_1Constant__coll__graph.svg" width="216" height="1048"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe>
+<div class="center"><iframe scrolling="no" frameborder="0" src="classtvm_1_1relay_1_1Constant__coll__graph.svg" width="216" height="1078"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe>
</div>
</div>
<table class="memberdecls">
@@ -91,6 +91,8 @@ Public Member Functions</h2></td></tr>
<tr class="separator:acb2c9584fe3314ae16eab1670c554746"><td class="memSeparator" colspan="2"> </td></tr>
<tr class="memitem:a15da36fc6e073997d7363e1979eda258"><td class="memItemLeft" align="right" valign="top"> </td><td class="memItemRight" valign="bottom"><a class="el" href="classtvm_1_1relay_1_1Constant.html#a15da36fc6e073997d7363e1979eda258">TVM_DEFINE_OBJECT_REF_METHODS</a> (<a class="el" href="classtvm_1_1relay_1_1Constant.html">Constant</a>, <a class="el" href="classtvm_1_1RelayExpr.html">RelayExpr</a>, <a class="el" href="classtvm_1_1relay_1_1ConstantNode.html">ConstantNode</a>) [...]
<tr class="separator:a15da36fc6e073997d7363e1979eda258"><td class="memSeparator" colspan="2"> </td></tr>
+<tr class="memitem:a0d1cb4aa284cd726e7efb9cacf061540"><td class="memItemLeft" align="right" valign="top"> </td><td class="memItemRight" valign="bottom"><a class="el" href="classtvm_1_1relay_1_1Constant.html#a0d1cb4aa284cd726e7efb9cacf061540">TVM_DEFINE_OBJECT_REF_COW_METHOD</a> (<a class="el" href="classtvm_1_1relay_1_1ConstantNode.html">ConstantNode</a>)</td></tr>
+<tr class="separator:a0d1cb4aa284cd726e7efb9cacf061540"><td class="memSeparator" colspan="2"> </td></tr>
<tr class="inherit_header pub_methods_classtvm_1_1RelayExpr"><td colspan="2" onclick="javascript:toggleInherit('pub_methods_classtvm_1_1RelayExpr')"><img src="closed.png" alt="-"/> Public Member Functions inherited from <a class="el" href="classtvm_1_1RelayExpr.html">tvm::RelayExpr</a></td></tr>
<tr class="memitem:aaeb30fa197e3d25cd7897f3ac4dcef9d inherit pub_methods_classtvm_1_1RelayExpr"><td class="memItemLeft" align="right" valign="top"> </td><td class="memItemRight" valign="bottom"><a class="el" href="classtvm_1_1RelayExpr.html#aaeb30fa197e3d25cd7897f3ac4dcef9d">TVM_DEFINE_OBJECT_REF_METHODS</a> (<a class="el" href="classtvm_1_1RelayExpr.html">RelayExpr</a>, <a class="el" href="classtvm_1_1BaseExpr.html">BaseExpr</a>, <a class="el" href="classtvm_1_1RelayExprNode.html"> [...]
<tr class="separator:aaeb30fa197e3d25cd7897f3ac4dcef9d inherit pub_methods_classtvm_1_1RelayExpr"><td class="memSeparator" colspan="2"> </td></tr>
@@ -207,6 +209,24 @@ Additional Inherited Members</h2></td></tr>
</div>
</div>
<h2 class="groupheader">Member Function Documentation</h2>
+<a id="a0d1cb4aa284cd726e7efb9cacf061540"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a0d1cb4aa284cd726e7efb9cacf061540">◆ </a></span>TVM_DEFINE_OBJECT_REF_COW_METHOD()</h2>
+
+<div class="memitem">
+<div class="memproto">
+ <table class="memname">
+ <tr>
+ <td class="memname">tvm::relay::Constant::TVM_DEFINE_OBJECT_REF_COW_METHOD </td>
+ <td>(</td>
+ <td class="paramtype"><a class="el" href="classtvm_1_1relay_1_1ConstantNode.html">ConstantNode</a> </td>
+ <td class="paramname"></td><td>)</td>
+ <td></td>
+ </tr>
+ </table>
+</div><div class="memdoc">
+
+</div>
+</div>
<a id="a15da36fc6e073997d7363e1979eda258"></a>
<h2 class="memtitle"><span class="permalink"><a href="#a15da36fc6e073997d7363e1979eda258">◆ </a></span>TVM_DEFINE_OBJECT_REF_METHODS()</h2>
diff --git a/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant__coll__graph.svg b/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant__coll__graph.svg
index 4f970a48c..fc1f3909f 100644
--- a/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant__coll__graph.svg
+++ b/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant__coll__graph.svg
@@ -4,127 +4,129 @@
<!-- Generated by graphviz version 2.40.1 (20161225.0304)
-->
<!-- Title: tvm::relay::Constant Pages: 1 -->
-<svg width="162pt" height="786pt"
- viewBox="0.00 0.00 162.00 786.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
-<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 782)">
+<svg width="162pt" height="808pt"
+ viewBox="0.00 0.00 162.00 808.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
+<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 804)">
<title>tvm::relay::Constant</title>
-<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-782 158,-782 158,4 -4,4"/>
+<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-804 158,-804 158,4 -4,4"/>
<!-- Node4 -->
<g id="node1" class="node">
<title>Node4</title>
-<polygon fill="#bfbfbf" stroke="#000000" points="0,-.5 0,-79.5 154,-79.5 154,-.5 0,-.5"/>
-<text text-anchor="middle" x="77" y="-67.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::relay::Constant</text>
-<polyline fill="none" stroke="#000000" points="0,-60.5 154,-60.5 "/>
-<text text-anchor="middle" x="77" y="-48.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
-<polyline fill="none" stroke="#000000" points="0,-41.5 154,-41.5 "/>
-<text text-anchor="start" x="8" y="-29.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ Constant()</text>
+<polygon fill="#bfbfbf" stroke="#000000" points="0,-.5 0,-101.5 154,-101.5 154,-.5 0,-.5"/>
+<text text-anchor="middle" x="77" y="-89.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::relay::Constant</text>
+<polyline fill="none" stroke="#000000" points="0,-82.5 154,-82.5 "/>
+<text text-anchor="middle" x="77" y="-70.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
+<polyline fill="none" stroke="#000000" points="0,-63.5 154,-63.5 "/>
+<text text-anchor="start" x="8" y="-51.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ Constant()</text>
+<text text-anchor="start" x="8" y="-40.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="8" y="-29.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
<text text-anchor="start" x="8" y="-18.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
-<text text-anchor="start" x="8" y="-7.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<text text-anchor="start" x="8" y="-7.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_COW_METHOD()</text>
</g>
<!-- Node5 -->
<g id="node2" class="node">
<title>Node5</title>
<g id="a_node2"><a xlink:href="classtvm_1_1RelayExpr.html" target="_top" xlink:title="Managed reference to RelayExprNode. ">
-<polygon fill="#ffffff" stroke="#000000" points="0,-117.5 0,-185.5 154,-185.5 154,-117.5 0,-117.5"/>
-<text text-anchor="middle" x="77" y="-173.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">Expr</text>
-<polyline fill="none" stroke="#000000" points="0,-166.5 154,-166.5 "/>
-<text text-anchor="middle" x="77" y="-154.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
-<polyline fill="none" stroke="#000000" points="0,-147.5 154,-147.5 "/>
-<text text-anchor="start" x="8" y="-135.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
-<text text-anchor="start" x="8" y="-124.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<polygon fill="#ffffff" stroke="#000000" points="0,-139.5 0,-207.5 154,-207.5 154,-139.5 0,-139.5"/>
+<text text-anchor="middle" x="77" y="-195.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">Expr</text>
+<polyline fill="none" stroke="#000000" points="0,-188.5 154,-188.5 "/>
+<text text-anchor="middle" x="77" y="-176.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
+<polyline fill="none" stroke="#000000" points="0,-169.5 154,-169.5 "/>
+<text text-anchor="start" x="8" y="-157.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="8" y="-146.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
</a>
</g>
</g>
<!-- Node5->Node4 -->
<g id="edge1" class="edge">
<title>Node5->Node4</title>
-<path fill="none" stroke="#191970" d="M77,-107.3732C77,-98.2436 77,-88.6788 77,-79.718"/>
-<polygon fill="none" stroke="#191970" points="73.5001,-107.4404 77,-117.4405 80.5001,-107.4405 73.5001,-107.4404"/>
+<path fill="none" stroke="#191970" d="M77,-129.2437C77,-120.3671 77,-110.9733 77,-101.8923"/>
+<polygon fill="none" stroke="#191970" points="73.5001,-129.3769 77,-139.3769 80.5001,-129.377 73.5001,-129.3769"/>
</g>
<!-- Node6 -->
<g id="node3" class="node">
<title>Node6</title>
<g id="a_node3"><a xlink:href="classtvm_1_1BaseExpr.html" target="_top" xlink:title="Managed reference to BaseExprNode. ">
-<polygon fill="#ffffff" stroke="#000000" points="0,-223.5 0,-291.5 154,-291.5 154,-223.5 0,-223.5"/>
-<text text-anchor="middle" x="77" y="-279.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::BaseExpr</text>
-<polyline fill="none" stroke="#000000" points="0,-272.5 154,-272.5 "/>
-<text text-anchor="middle" x="77" y="-260.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
-<polyline fill="none" stroke="#000000" points="0,-253.5 154,-253.5 "/>
-<text text-anchor="start" x="8" y="-241.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
-<text text-anchor="start" x="8" y="-230.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<polygon fill="#ffffff" stroke="#000000" points="0,-245.5 0,-313.5 154,-313.5 154,-245.5 0,-245.5"/>
+<text text-anchor="middle" x="77" y="-301.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::BaseExpr</text>
+<polyline fill="none" stroke="#000000" points="0,-294.5 154,-294.5 "/>
+<text text-anchor="middle" x="77" y="-282.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
+<polyline fill="none" stroke="#000000" points="0,-275.5 154,-275.5 "/>
+<text text-anchor="start" x="8" y="-263.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="8" y="-252.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
</a>
</g>
</g>
<!-- Node6->Node5 -->
<g id="edge2" class="edge">
<title>Node6->Node5</title>
-<path fill="none" stroke="#191970" d="M77,-213.1711C77,-203.9405 77,-194.3493 77,-185.5586"/>
-<polygon fill="none" stroke="#191970" points="73.5001,-213.3774 77,-223.3775 80.5001,-213.3775 73.5001,-213.3774"/>
+<path fill="none" stroke="#191970" d="M77,-235.1711C77,-225.9405 77,-216.3493 77,-207.5586"/>
+<polygon fill="none" stroke="#191970" points="73.5001,-235.3774 77,-245.3775 80.5001,-235.3775 73.5001,-235.3774"/>
</g>
<!-- Node7 -->
<g id="node4" class="node">
<title>Node7</title>
<g id="a_node4"><a xlink:href="classtvm_1_1runtime_1_1ObjectRef.html" target="_top" xlink:title="Base class of all object reference. ">
-<polygon fill="#ffffff" stroke="#000000" points="10,-329.5 10,-551.5 144,-551.5 144,-329.5 10,-329.5"/>
-<text text-anchor="middle" x="77" y="-539.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::runtime::ObjectRef</text>
-<polyline fill="none" stroke="#000000" points="10,-532.5 144,-532.5 "/>
-<text text-anchor="start" x="18" y="-520.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ _type_is_nullable</text>
-<polyline fill="none" stroke="#000000" points="10,-513.5 144,-513.5 "/>
-<text text-anchor="start" x="18" y="-501.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectRef()</text>
-<text text-anchor="start" x="18" y="-490.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectRef()</text>
-<text text-anchor="start" x="18" y="-479.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ same_as()</text>
-<text text-anchor="start" x="18" y="-468.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator==()</text>
-<text text-anchor="start" x="18" y="-457.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator!=()</text>
-<text text-anchor="start" x="18" y="-446.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator<()</text>
-<text text-anchor="start" x="18" y="-435.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ defined()</text>
-<text text-anchor="start" x="18" y="-424.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ get()</text>
-<text text-anchor="start" x="18" y="-413.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator->()</text>
-<text text-anchor="start" x="18" y="-402.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ unique()</text>
-<text text-anchor="start" x="18" y="-391.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ use_count()</text>
-<text text-anchor="start" x="18" y="-380.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ as()</text>
-<text text-anchor="start" x="18" y="-369.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># get_mutable()</text>
-<text text-anchor="start" x="18" y="-358.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># DowncastNoCheck()</text>
-<text text-anchor="start" x="18" y="-347.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># FFIClearAfterMove()</text>
-<text text-anchor="start" x="18" y="-336.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># GetDataPtr()</text>
+<polygon fill="#ffffff" stroke="#000000" points="10,-351.5 10,-573.5 144,-573.5 144,-351.5 10,-351.5"/>
+<text text-anchor="middle" x="77" y="-561.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::runtime::ObjectRef</text>
+<polyline fill="none" stroke="#000000" points="10,-554.5 144,-554.5 "/>
+<text text-anchor="start" x="18" y="-542.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ _type_is_nullable</text>
+<polyline fill="none" stroke="#000000" points="10,-535.5 144,-535.5 "/>
+<text text-anchor="start" x="18" y="-523.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectRef()</text>
+<text text-anchor="start" x="18" y="-512.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectRef()</text>
+<text text-anchor="start" x="18" y="-501.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ same_as()</text>
+<text text-anchor="start" x="18" y="-490.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator==()</text>
+<text text-anchor="start" x="18" y="-479.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator!=()</text>
+<text text-anchor="start" x="18" y="-468.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator<()</text>
+<text text-anchor="start" x="18" y="-457.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ defined()</text>
+<text text-anchor="start" x="18" y="-446.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ get()</text>
+<text text-anchor="start" x="18" y="-435.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator->()</text>
+<text text-anchor="start" x="18" y="-424.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ unique()</text>
+<text text-anchor="start" x="18" y="-413.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ use_count()</text>
+<text text-anchor="start" x="18" y="-402.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ as()</text>
+<text text-anchor="start" x="18" y="-391.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># get_mutable()</text>
+<text text-anchor="start" x="18" y="-380.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># DowncastNoCheck()</text>
+<text text-anchor="start" x="18" y="-369.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># FFIClearAfterMove()</text>
+<text text-anchor="start" x="18" y="-358.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># GetDataPtr()</text>
</a>
</g>
</g>
<!-- Node7->Node6 -->
<g id="edge3" class="edge">
<title>Node7->Node6</title>
-<path fill="none" stroke="#191970" d="M77,-319.2198C77,-309.2931 77,-299.8914 77,-291.5572"/>
-<polygon fill="none" stroke="#191970" points="73.5001,-319.3012 77,-329.3012 80.5001,-319.3012 73.5001,-319.3012"/>
+<path fill="none" stroke="#191970" d="M77,-341.2198C77,-331.2931 77,-321.8914 77,-313.5572"/>
+<polygon fill="none" stroke="#191970" points="73.5001,-341.3012 77,-351.3012 80.5001,-341.3012 73.5001,-341.3012"/>
</g>
<!-- Node8 -->
<g id="node5" class="node">
<title>Node8</title>
<g id="a_node5"><a xlink:href="classtvm_1_1runtime_1_1ObjectPtr.html" target="_top" xlink:title="{tvm::runtime::ObjectPtr\l\< tvm::runtime::Object \>\n||+ ObjectPtr()\l+ ObjectPtr()\l+ ObjectPtr()\l+ ObjectPtr()\l+ ObjectPtr()\l+ ObjectPtr()\l+ ~ObjectPtr()\l+ swap()\l+ get()\l+ operator-\>()\land 11 more...\l}">
-<polygon fill="#ffffff" stroke="#000000" points="7,-599.5 7,-777.5 147,-777.5 147,-599.5 7,-599.5"/>
-<text text-anchor="start" x="15" y="-765.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::runtime::ObjectPtr</text>
-<text text-anchor="middle" x="77" y="-754.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">< tvm::runtime::Object ></text>
-<polyline fill="none" stroke="#000000" points="7,-747.5 147,-747.5 "/>
-<text text-anchor="middle" x="77" y="-735.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
-<polyline fill="none" stroke="#000000" points="7,-728.5 147,-728.5 "/>
+<polygon fill="#ffffff" stroke="#000000" points="7,-621.5 7,-799.5 147,-799.5 147,-621.5 7,-621.5"/>
+<text text-anchor="start" x="15" y="-787.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::runtime::ObjectPtr</text>
+<text text-anchor="middle" x="77" y="-776.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">< tvm::runtime::Object ></text>
+<polyline fill="none" stroke="#000000" points="7,-769.5 147,-769.5 "/>
+<text text-anchor="middle" x="77" y="-757.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
+<polyline fill="none" stroke="#000000" points="7,-750.5 147,-750.5 "/>
+<text text-anchor="start" x="15" y="-738.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectPtr()</text>
+<text text-anchor="start" x="15" y="-727.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectPtr()</text>
<text text-anchor="start" x="15" y="-716.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectPtr()</text>
<text text-anchor="start" x="15" y="-705.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectPtr()</text>
<text text-anchor="start" x="15" y="-694.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectPtr()</text>
<text text-anchor="start" x="15" y="-683.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectPtr()</text>
-<text text-anchor="start" x="15" y="-672.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectPtr()</text>
-<text text-anchor="start" x="15" y="-661.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectPtr()</text>
-<text text-anchor="start" x="15" y="-650.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ~ObjectPtr()</text>
-<text text-anchor="start" x="15" y="-639.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ swap()</text>
-<text text-anchor="start" x="15" y="-628.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ get()</text>
-<text text-anchor="start" x="15" y="-617.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator->()</text>
-<text text-anchor="start" x="15" y="-606.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">and 11 more...</text>
+<text text-anchor="start" x="15" y="-672.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ~ObjectPtr()</text>
+<text text-anchor="start" x="15" y="-661.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ swap()</text>
+<text text-anchor="start" x="15" y="-650.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ get()</text>
+<text text-anchor="start" x="15" y="-639.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator->()</text>
+<text text-anchor="start" x="15" y="-628.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">and 11 more...</text>
</a>
</g>
</g>
<!-- Node8->Node7 -->
<g id="edge4" class="edge">
<title>Node8->Node7</title>
-<path fill="none" stroke="#404040" d="M77,-599.3167C77,-587.8765 77,-576.0062 77,-564.1402"/>
-<polygon fill="none" stroke="#404040" points="77.0001,-563.7944 73,-557.7944 77,-551.7944 81,-557.7943 77.0001,-563.7944"/>
-<text text-anchor="middle" x="96.5" y="-573" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> #data_</text>
+<path fill="none" stroke="#404040" d="M77,-621.3167C77,-609.8765 77,-598.0062 77,-586.1402"/>
+<polygon fill="none" stroke="#404040" points="77.0001,-585.7944 73,-579.7944 77,-573.7944 81,-579.7943 77.0001,-585.7944"/>
+<text text-anchor="middle" x="96.5" y="-595" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> #data_</text>
</g>
</g>
</svg>
diff --git a/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant__inherit__graph.svg b/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant__inherit__graph.svg
index 06a4893b2..4c4c6f8d7 100644
--- a/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant__inherit__graph.svg
+++ b/docs/reference/api/doxygen/classtvm_1_1relay_1_1Constant__inherit__graph.svg
@@ -4,97 +4,99 @@
<!-- Generated by graphviz version 2.40.1 (20161225.0304)
-->
<!-- Title: tvm::relay::Constant Pages: 1 -->
-<svg width="162pt" height="568pt"
- viewBox="0.00 0.00 162.00 568.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
-<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 564)">
+<svg width="162pt" height="590pt"
+ viewBox="0.00 0.00 162.00 590.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
+<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 586)">
<title>tvm::relay::Constant</title>
-<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-564 158,-564 158,4 -4,4"/>
+<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-586 158,-586 158,4 -4,4"/>
<!-- Node0 -->
<g id="node1" class="node">
<title>Node0</title>
-<polygon fill="#bfbfbf" stroke="#000000" points="0,-.5 0,-79.5 154,-79.5 154,-.5 0,-.5"/>
-<text text-anchor="middle" x="77" y="-67.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::relay::Constant</text>
-<polyline fill="none" stroke="#000000" points="0,-60.5 154,-60.5 "/>
-<text text-anchor="middle" x="77" y="-48.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
-<polyline fill="none" stroke="#000000" points="0,-41.5 154,-41.5 "/>
-<text text-anchor="start" x="8" y="-29.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ Constant()</text>
+<polygon fill="#bfbfbf" stroke="#000000" points="0,-.5 0,-101.5 154,-101.5 154,-.5 0,-.5"/>
+<text text-anchor="middle" x="77" y="-89.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::relay::Constant</text>
+<polyline fill="none" stroke="#000000" points="0,-82.5 154,-82.5 "/>
+<text text-anchor="middle" x="77" y="-70.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
+<polyline fill="none" stroke="#000000" points="0,-63.5 154,-63.5 "/>
+<text text-anchor="start" x="8" y="-51.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ Constant()</text>
+<text text-anchor="start" x="8" y="-40.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="8" y="-29.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
<text text-anchor="start" x="8" y="-18.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
-<text text-anchor="start" x="8" y="-7.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<text text-anchor="start" x="8" y="-7.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_COW_METHOD()</text>
</g>
<!-- Node1 -->
<g id="node2" class="node">
<title>Node1</title>
<g id="a_node2"><a xlink:href="classtvm_1_1RelayExpr.html" target="_top" xlink:title="Managed reference to RelayExprNode. ">
-<polygon fill="#ffffff" stroke="#000000" points="0,-116.5 0,-184.5 154,-184.5 154,-116.5 0,-116.5"/>
-<text text-anchor="middle" x="77" y="-172.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">Expr</text>
-<polyline fill="none" stroke="#000000" points="0,-165.5 154,-165.5 "/>
-<text text-anchor="middle" x="77" y="-153.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
-<polyline fill="none" stroke="#000000" points="0,-146.5 154,-146.5 "/>
-<text text-anchor="start" x="8" y="-134.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
-<text text-anchor="start" x="8" y="-123.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<polygon fill="#ffffff" stroke="#000000" points="0,-138.5 0,-206.5 154,-206.5 154,-138.5 0,-138.5"/>
+<text text-anchor="middle" x="77" y="-194.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">Expr</text>
+<polyline fill="none" stroke="#000000" points="0,-187.5 154,-187.5 "/>
+<text text-anchor="middle" x="77" y="-175.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
+<polyline fill="none" stroke="#000000" points="0,-168.5 154,-168.5 "/>
+<text text-anchor="start" x="8" y="-156.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="8" y="-145.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
</a>
</g>
</g>
<!-- Node1->Node0 -->
<g id="edge1" class="edge">
<title>Node1->Node0</title>
-<path fill="none" stroke="#191970" d="M77,-106.3044C77,-97.4634 77,-88.2287 77,-79.5566"/>
-<polygon fill="none" stroke="#191970" points="73.5001,-106.4447 77,-116.4447 80.5001,-106.4447 73.5001,-106.4447"/>
+<path fill="none" stroke="#191970" d="M77,-128.1349C77,-119.5408 77,-110.474 77,-101.6952"/>
+<polygon fill="none" stroke="#191970" points="73.5001,-128.3321 77,-138.3321 80.5001,-128.3322 73.5001,-128.3321"/>
</g>
<!-- Node2 -->
<g id="node3" class="node">
<title>Node2</title>
<g id="a_node3"><a xlink:href="classtvm_1_1BaseExpr.html" target="_top" xlink:title="Managed reference to BaseExprNode. ">
-<polygon fill="#ffffff" stroke="#000000" points="0,-221.5 0,-289.5 154,-289.5 154,-221.5 0,-221.5"/>
-<text text-anchor="middle" x="77" y="-277.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::BaseExpr</text>
-<polyline fill="none" stroke="#000000" points="0,-270.5 154,-270.5 "/>
-<text text-anchor="middle" x="77" y="-258.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
-<polyline fill="none" stroke="#000000" points="0,-251.5 154,-251.5 "/>
-<text text-anchor="start" x="8" y="-239.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
-<text text-anchor="start" x="8" y="-228.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
+<polygon fill="#ffffff" stroke="#000000" points="0,-243.5 0,-311.5 154,-311.5 154,-243.5 0,-243.5"/>
+<text text-anchor="middle" x="77" y="-299.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::BaseExpr</text>
+<polyline fill="none" stroke="#000000" points="0,-292.5 154,-292.5 "/>
+<text text-anchor="middle" x="77" y="-280.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"> </text>
+<polyline fill="none" stroke="#000000" points="0,-273.5 154,-273.5 "/>
+<text text-anchor="start" x="8" y="-261.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ TVM_DEFINE_OBJECT_REF</text>
+<text text-anchor="start" x="8" y="-250.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">_METHODS()</text>
</a>
</g>
</g>
<!-- Node2->Node1 -->
<g id="edge2" class="edge">
<title>Node2->Node1</title>
-<path fill="none" stroke="#191970" d="M77,-211.2353C77,-202.3902 77,-193.2253 77,-184.7788"/>
-<polygon fill="none" stroke="#191970" points="73.5001,-211.4095 77,-221.4095 80.5001,-211.4095 73.5001,-211.4095"/>
+<path fill="none" stroke="#191970" d="M77,-233.2353C77,-224.3902 77,-215.2253 77,-206.7788"/>
+<polygon fill="none" stroke="#191970" points="73.5001,-233.4095 77,-243.4095 80.5001,-233.4095 73.5001,-233.4095"/>
</g>
<!-- Node3 -->
<g id="node4" class="node">
<title>Node3</title>
<g id="a_node4"><a xlink:href="classtvm_1_1runtime_1_1ObjectRef.html" target="_top" xlink:title="Base class of all object reference. ">
-<polygon fill="#ffffff" stroke="#000000" points="10,-326.5 10,-559.5 144,-559.5 144,-326.5 10,-326.5"/>
-<text text-anchor="middle" x="77" y="-547.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::runtime::ObjectRef</text>
-<polyline fill="none" stroke="#000000" points="10,-540.5 144,-540.5 "/>
-<text text-anchor="start" x="18" y="-528.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ _type_is_nullable</text>
-<text text-anchor="start" x="18" y="-517.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># data_</text>
-<polyline fill="none" stroke="#000000" points="10,-510.5 144,-510.5 "/>
-<text text-anchor="start" x="18" y="-498.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectRef()</text>
-<text text-anchor="start" x="18" y="-487.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectRef()</text>
-<text text-anchor="start" x="18" y="-476.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ same_as()</text>
-<text text-anchor="start" x="18" y="-465.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator==()</text>
-<text text-anchor="start" x="18" y="-454.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator!=()</text>
-<text text-anchor="start" x="18" y="-443.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator<()</text>
-<text text-anchor="start" x="18" y="-432.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ defined()</text>
-<text text-anchor="start" x="18" y="-421.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ get()</text>
-<text text-anchor="start" x="18" y="-410.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator->()</text>
-<text text-anchor="start" x="18" y="-399.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ unique()</text>
-<text text-anchor="start" x="18" y="-388.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ use_count()</text>
-<text text-anchor="start" x="18" y="-377.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ as()</text>
-<text text-anchor="start" x="18" y="-366.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># get_mutable()</text>
-<text text-anchor="start" x="18" y="-355.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># DowncastNoCheck()</text>
-<text text-anchor="start" x="18" y="-344.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># FFIClearAfterMove()</text>
-<text text-anchor="start" x="18" y="-333.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># GetDataPtr()</text>
+<polygon fill="#ffffff" stroke="#000000" points="10,-348.5 10,-581.5 144,-581.5 144,-348.5 10,-348.5"/>
+<text text-anchor="middle" x="77" y="-569.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">tvm::runtime::ObjectRef</text>
+<polyline fill="none" stroke="#000000" points="10,-562.5 144,-562.5 "/>
+<text text-anchor="start" x="18" y="-550.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ _type_is_nullable</text>
+<text text-anchor="start" x="18" y="-539.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># data_</text>
+<polyline fill="none" stroke="#000000" points="10,-532.5 144,-532.5 "/>
+<text text-anchor="start" x="18" y="-520.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectRef()</text>
+<text text-anchor="start" x="18" y="-509.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ ObjectRef()</text>
+<text text-anchor="start" x="18" y="-498.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ same_as()</text>
+<text text-anchor="start" x="18" y="-487.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator==()</text>
+<text text-anchor="start" x="18" y="-476.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator!=()</text>
+<text text-anchor="start" x="18" y="-465.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator<()</text>
+<text text-anchor="start" x="18" y="-454.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ defined()</text>
+<text text-anchor="start" x="18" y="-443.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ get()</text>
+<text text-anchor="start" x="18" y="-432.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ operator->()</text>
+<text text-anchor="start" x="18" y="-421.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ unique()</text>
+<text text-anchor="start" x="18" y="-410.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ use_count()</text>
+<text text-anchor="start" x="18" y="-399.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000">+ as()</text>
+<text text-anchor="start" x="18" y="-388.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># get_mutable()</text>
+<text text-anchor="start" x="18" y="-377.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># DowncastNoCheck()</text>
+<text text-anchor="start" x="18" y="-366.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># FFIClearAfterMove()</text>
+<text text-anchor="start" x="18" y="-355.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="#000000"># GetDataPtr()</text>
</a>
</g>
</g>
<!-- Node3->Node2 -->
<g id="edge3" class="edge">
<title>Node3->Node2</title>
-<path fill="none" stroke="#191970" d="M77,-316.2822C77,-306.749 77,-297.737 77,-289.7166"/>
-<polygon fill="none" stroke="#191970" points="73.5001,-316.3857 77,-326.3857 80.5001,-316.3857 73.5001,-316.3857"/>
+<path fill="none" stroke="#191970" d="M77,-338.2822C77,-328.749 77,-319.737 77,-311.7166"/>
+<polygon fill="none" stroke="#191970" points="73.5001,-338.3857 77,-348.3857 80.5001,-338.3857 73.5001,-338.3857"/>
</g>
</g>
</svg>
diff --git a/docs/reference/api/doxygen/functions_func_t.html b/docs/reference/api/doxygen/functions_func_t.html
index 46327240b..b28955986 100644
--- a/docs/reference/api/doxygen/functions_func_t.html
+++ b/docs/reference/api/doxygen/functions_func_t.html
@@ -763,6 +763,7 @@ $(function() {
, <a class="el" href="classtvm_1_1IRModule.html#ac3d7b217437ecefbd9096a57325ae29a">tvm::IRModule</a>
, <a class="el" href="classtvm_1_1relay_1_1Call.html#ab9eee004a05e13a319c9f1db05602754">tvm::relay::Call</a>
, <a class="el" href="classtvm_1_1relay_1_1Clause.html#a53074960bfc52dd8fbccfd543758f005">tvm::relay::Clause</a>
+, <a class="el" href="classtvm_1_1relay_1_1Constant.html#a0d1cb4aa284cd726e7efb9cacf061540">tvm::relay::Constant</a>
, <a class="el" href="classtvm_1_1relay_1_1Function.html#ac085d821f02ee1e2a4927f85f72f6862">tvm::relay::Function</a>
, <a class="el" href="classtvm_1_1relay_1_1FunctionPattern.html#ac9755e2e62689ab47e889db57433e338">tvm::relay::FunctionPattern</a>
, <a class="el" href="classtvm_1_1relay_1_1If.html#a75f8258495a590eeaa4dc161b650d2f7">tvm::relay::If</a>
@@ -1005,7 +1006,7 @@ $(function() {
: <a class="el" href="classtvm_1_1runtime_1_1TVMPODValue__.html#a2f46b59a6c1d5eb4575d7f583b5f1a0c">tvm::runtime::TVMPODValue_</a>
</li>
<li>TVMRetValue()
-: <a class="el" href="classtvm_1_1runtime_1_1TVMRetValue.html#ac4a3850c0989e7c2d5cd8e0f096d0997">tvm::runtime::TVMRetValue</a>
+: <a class="el" href="classtvm_1_1runtime_1_1TVMRetValue.html#a77455a8fe7d27b90a01a64f1cd28e9ec">tvm::runtime::TVMRetValue</a>
</li>
<li>type()
: <a class="el" href="classtvm_1_1runtime_1_1vm_1_1Allocator.html#a7cfb6d4ea480436801276fe2e7660eb2">tvm::runtime::vm::Allocator</a>
@@ -1034,7 +1035,7 @@ $(function() {
: <a class="el" href="classtvm_1_1TypedEnvFunc_3_01R_07Args_8_8_8_08_4.html#a41a6b9014d0feeb628ca7edfd0d26f0b">tvm::TypedEnvFunc< R(Args...)></a>
</li>
<li>TypedPackedFunc()
-: <a class="el" href="classtvm_1_1runtime_1_1TypedPackedFunc_3_01R_07Args_8_8_8_08_4.html#a0161d426f9ca366c860ad48c384f7192">tvm::runtime::TypedPackedFunc< R(Args...)></a>
+: <a class="el" href="classtvm_1_1runtime_1_1TypedPackedFunc_3_01R_07Args_8_8_8_08_4.html#a4abadc6786dd14a3aed6e2b5b342d1d6">tvm::runtime::TypedPackedFunc< R(Args...)></a>
</li>
<li>TypeIndex2Key()
: <a class="el" href="classtvm_1_1runtime_1_1Object.html#a817ba6c23b7ee1821c48a75edf255a30">tvm::runtime::Object</a>
@@ -1057,7 +1058,7 @@ $(function() {
: <a class="el" href="classtvm_1_1TypeRelation.html#ac26b1897eab8197ed26606ab81b7403b">tvm::TypeRelation</a>
</li>
<li>TypeReporter()
-: <a class="el" href="classtvm_1_1TypeReporter.html#a8e7e05a07f9f7ad9bea91f27afac9051">tvm::TypeReporter</a>
+: <a class="el" href="classtvm_1_1TypeReporter.html#aa3dc38a3c84d324d0b3a9f358460a091">tvm::TypeReporter</a>
</li>
<li>TypeVar()
: <a class="el" href="classtvm_1_1TypeVar.html#adf5ef8e89d162735519b5d125c89e3e3">tvm::TypeVar</a>
diff --git a/docs/reference/api/doxygen/functions_t.html b/docs/reference/api/doxygen/functions_t.html
index ecdae1fb1..3d04d3142 100644
--- a/docs/reference/api/doxygen/functions_t.html
+++ b/docs/reference/api/doxygen/functions_t.html
@@ -942,6 +942,7 @@ $(function() {
, <a class="el" href="classtvm_1_1IRModule.html#ac3d7b217437ecefbd9096a57325ae29a">tvm::IRModule</a>
, <a class="el" href="classtvm_1_1relay_1_1Call.html#ab9eee004a05e13a319c9f1db05602754">tvm::relay::Call</a>
, <a class="el" href="classtvm_1_1relay_1_1Clause.html#a53074960bfc52dd8fbccfd543758f005">tvm::relay::Clause</a>
+, <a class="el" href="classtvm_1_1relay_1_1Constant.html#a0d1cb4aa284cd726e7efb9cacf061540">tvm::relay::Constant</a>
, <a class="el" href="classtvm_1_1relay_1_1Function.html#ac085d821f02ee1e2a4927f85f72f6862">tvm::relay::Function</a>
, <a class="el" href="classtvm_1_1relay_1_1FunctionPattern.html#ac9755e2e62689ab47e889db57433e338">tvm::relay::FunctionPattern</a>
, <a class="el" href="classtvm_1_1relay_1_1If.html#a75f8258495a590eeaa4dc161b650d2f7">tvm::relay::If</a>
@@ -1200,7 +1201,7 @@ $(function() {
, <a class="el" href="classtvm_1_1runtime_1_1ObjectPtr.html#ae0ea8b4adc6dab8c74086bceaef6b3e1">tvm::runtime::ObjectPtr< T ></a>
, <a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#ae0ea8b4adc6dab8c74086bceaef6b3e1">tvm::runtime::ObjectRef</a>
, <a class="el" href="classtvm_1_1runtime_1_1TVMPODValue__.html#ae0ea8b4adc6dab8c74086bceaef6b3e1">tvm::runtime::TVMPODValue_</a>
-, <a class="el" href="classtvm_1_1runtime_1_1TVMRetValue.html#ab86bf21f214fca72e73a7f6e20ffab8d">tvm::runtime::TVMRetValue</a>
+, <a class="el" href="classtvm_1_1runtime_1_1TVMRetValue.html#a77455a8fe7d27b90a01a64f1cd28e9ec">tvm::runtime::TVMRetValue</a>
, <a class="el" href="classtvm_1_1runtime_1_1TypedPackedFunc_3_01R_07Args_8_8_8_08_4.html#ae0ea8b4adc6dab8c74086bceaef6b3e1">tvm::runtime::TypedPackedFunc< R(Args...)></a>
</li>
<li>type
@@ -1272,7 +1273,7 @@ $(function() {
: <a class="el" href="classtvm_1_1TypedEnvFunc_3_01R_07Args_8_8_8_08_4.html#a0d72a6fa7263821c14bcd37837998ed9">tvm::TypedEnvFunc< R(Args...)></a>
</li>
<li>TypedPackedFunc()
-: <a class="el" href="classtvm_1_1runtime_1_1TypedPackedFunc_3_01R_07Args_8_8_8_08_4.html#a0161d426f9ca366c860ad48c384f7192">tvm::runtime::TypedPackedFunc< R(Args...)></a>
+: <a class="el" href="classtvm_1_1runtime_1_1TypedPackedFunc_3_01R_07Args_8_8_8_08_4.html#af45a2ceff92e6f6c394ea766a45027a0">tvm::runtime::TypedPackedFunc< R(Args...)></a>
</li>
<li>TypeIndex2Key()
: <a class="el" href="classtvm_1_1runtime_1_1Object.html#a817ba6c23b7ee1821c48a75edf255a30">tvm::runtime::Object</a>
@@ -1295,7 +1296,7 @@ $(function() {
: <a class="el" href="classtvm_1_1TypeRelation.html#ac26b1897eab8197ed26606ab81b7403b">tvm::TypeRelation</a>
</li>
<li>TypeReporter()
-: <a class="el" href="classtvm_1_1TypeReporter.html#a8e7e05a07f9f7ad9bea91f27afac9051">tvm::TypeReporter</a>
+: <a class="el" href="classtvm_1_1TypeReporter.html#aa3dc38a3c84d324d0b3a9f358460a091">tvm::TypeReporter</a>
</li>
<li>types
: <a class="el" href="classtvm_1_1TupleAffineTypeNode.html#a30c834b7e1cb64467e6587ac16ebb187">tvm::TupleAffineTypeNode</a>
diff --git a/docs/reference/api/doxygen/interpreter_8h_source.html b/docs/reference/api/doxygen/interpreter_8h_source.html
index 203c76dc9..833f26ced 100644
--- a/docs/reference/api/doxygen/interpreter_8h_source.html
+++ b/docs/reference/api/doxygen/interpreter_8h_source.html
@@ -110,7 +110,7 @@ $(function() {
<div class="ttc" id="classtvm_1_1runtime_1_1Map_html"><div class="ttname"><a href="classtvm_1_1runtime_1_1Map.html">tvm::runtime::Map</a></div><div class="ttdoc">Map container of NodeRef->NodeRef in DSL graph. Map implements copy on write semantics, which means map is mutable but copy will happen when array is referenced in more than two places. </div><div class="ttdef"><b>Definition:</b> map.h:1268</div></div>
<div class="ttc" id="classtvm_1_1relay_1_1RecClosureObj_html_a7a56c67a71f2d6d6621cdb0747b9dce0"><div class="ttname"><a href="classtvm_1_1relay_1_1RecClosureObj.html#a7a56c67a71f2d6d6621cdb0747b9dce0">tvm::relay::RecClosureObj::clos</a></div><div class="ttdeci">InterpreterClosure clos</div><div class="ttdoc">The closure. </div><div class="ttdef"><b>Definition:</b> interpreter.h:84</div></div>
<div class="ttc" id="classtvm_1_1relay_1_1RecClosureObj_html"><div class="ttname"><a href="classtvm_1_1relay_1_1RecClosureObj.html">tvm::relay::RecClosureObj</a></div><div class="ttdoc">The container type of RecClosure. </div><div class="ttdef"><b>Definition:</b> interpreter.h:81</div></div>
-<div class="ttc" id="classtvm_1_1relay_1_1Var_html"><div class="ttname"><a href="classtvm_1_1relay_1_1Var.html">tvm::relay::Var</a></div><div class="ttdef"><b>Definition:</b> expr.h:221</div></div>
+<div class="ttc" id="classtvm_1_1relay_1_1Var_html"><div class="ttname"><a href="classtvm_1_1relay_1_1Var.html">tvm::relay::Var</a></div><div class="ttdef"><b>Definition:</b> expr.h:234</div></div>
<div class="ttc" id="classtvm_1_1relay_1_1RecClosureObj_html_a352f57e3fda160d51855c046b877e244"><div class="ttname"><a href="classtvm_1_1relay_1_1RecClosureObj.html#a352f57e3fda160d51855c046b877e244">tvm::relay::RecClosureObj::VisitAttrs</a></div><div class="ttdeci">void VisitAttrs(tvm::AttrVisitor *v)</div><div class="ttdef"><b>Definition:</b> interpreter.h:90</div></div>
<div class="ttc" id="classtvm_1_1relay_1_1InterpreterClosureObj_html_a159f46324e99dc7ddd4ee8c78fa5cd3c"><div class="ttname"><a href="classtvm_1_1relay_1_1InterpreterClosureObj.html#a159f46324e99dc7ddd4ee8c78fa5cd3c">tvm::relay::InterpreterClosureObj::env</a></div><div class="ttdeci">tvm::Map< Var, ObjectRef > env</div><div class="ttdoc">The set of free variables in the closure. </div><div class="ttdef"><b>Definition:</b> interpreter.h:56</div></div>
<div class="ttc" id="namespacetvm_1_1topi_html_aaa95d3ad68932ab206efbe0a326db6a2"><div class="ttname"><a href="namespacetvm_1_1topi.html#aaa95d3ad68932ab206efbe0a326db6a2">tvm::topi::mod</a></div><div class="ttdeci">tvm::PrimExpr mod(const tvm::PrimExpr &a, const tvm::PrimExpr &b)</div><div class="ttdef"><b>Definition:</b> broadcast.h:290</div></div>
diff --git a/docs/reference/api/doxygen/ir_2expr_8h_source.html b/docs/reference/api/doxygen/ir_2expr_8h_source.html
index ed5e4e879..68a01907d 100644
--- a/docs/reference/api/doxygen/ir_2expr_8h_source.html
+++ b/docs/reference/api/doxygen/ir_2expr_8h_source.html
@@ -119,7 +119,7 @@ $(function() {
<div class="ttc" id="classtvm_1_1RangeNode_html_a53988be7b3181aa3b55eb991b615c48d"><div class="ttname"><a href="classtvm_1_1RangeNode.html#a53988be7b3181aa3b55eb991b615c48d">tvm::RangeNode::SEqualReduce</a></div><div class="ttdeci">bool SEqualReduce(const RangeNode *other, SEqualReducer equal) const</div><div class="ttdef"><b>Definition:</b> expr.h:481</div></div>
<div class="ttc" id="classtvm_1_1runtime_1_1DataType_html"><div class="ttname"><a href="classtvm_1_1runtime_1_1DataType.html">tvm::runtime::DataType</a></div><div class="ttdoc">Runtime primitive data type. </div><div class="ttdef"><b>Definition:</b> data_type.h:41</div></div>
<div class="ttc" id="classtvm_1_1BaseExprNode_html"><div class="ttname"><a href="classtvm_1_1BaseExprNode.html">tvm::BaseExprNode</a></div><div class="ttdoc">Base type of all the expressions. </div><div class="ttdef"><b>Definition:</b> expr.h:49</div></div>
-<div class="ttc" id="namespacetvm_1_1relay_html_a81ac7c3d0824529fddce7849c9c66289"><div class="ttname"><a href="namespacetvm_1_1relay.html#a81ac7c3d0824529fddce7849c9c66289">tvm::relay::GlobalVar</a></div><div class="ttdeci">tvm::GlobalVar GlobalVar</div><div class="ttdef"><b>Definition:</b> expr.h:48</div></div>
+<div class="ttc" id="namespacetvm_1_1relay_html_a81ac7c3d0824529fddce7849c9c66289"><div class="ttname"><a href="namespacetvm_1_1relay.html#a81ac7c3d0824529fddce7849c9c66289">tvm::relay::GlobalVar</a></div><div class="ttdeci">tvm::GlobalVar GlobalVar</div><div class="ttdef"><b>Definition:</b> expr.h:58</div></div>
<div class="ttc" id="structtvm_1_1runtime_1_1PackedFuncValueConverter_3_01PrimExpr_01_4_html_aa071662c3084d7ad3322351cb44c3dbf"><div class="ttname"><a href="structtvm_1_1runtime_1_1PackedFuncValueConverter_3_01PrimExpr_01_4.html#aa071662c3084d7ad3322351cb44c3dbf">tvm::runtime::PackedFuncValueConverter< PrimExpr >::From</a></div><div class="ttdeci">static PrimExpr From(const TVMPODValue_ &val)</div><div class="ttdef"><b>Definition:</b> expr.h:548</div></div>
<div class="ttc" id="classtvm_1_1BaseExprNode_html_a905dcf65204e877b6ccb977cf375f2a0"><div class="ttname"><a href="classtvm_1_1BaseExprNode.html#a905dcf65204e877b6ccb977cf375f2a0">tvm::BaseExprNode::_type_has_method_sequal_reduce</a></div><div class="ttdeci">static constexpr const bool _type_has_method_sequal_reduce</div><div class="ttdef"><b>Definition:</b> expr.h:58</div></div>
<div class="ttc" id="classtvm_1_1FloatImmNode_html_a74569b541c1056734fff07a23a05558e"><div class="ttname"><a href="classtvm_1_1FloatImmNode.html#a74569b541c1056734fff07a23a05558e">tvm::FloatImmNode::VisitAttrs</a></div><div class="ttdeci">void VisitAttrs(AttrVisitor *v)</div><div class="ttdef"><b>Definition:</b> expr.h:326</div></div>
diff --git a/docs/reference/api/doxygen/namespacemembers_func_w.html b/docs/reference/api/doxygen/namespacemembers_func_w.html
index 48f83d2f6..44755650c 100644
--- a/docs/reference/api/doxygen/namespacemembers_func_w.html
+++ b/docs/reference/api/doxygen/namespacemembers_func_w.html
@@ -74,7 +74,8 @@ $(function() {
: <a class="el" href="namespacetvm.html#aa01d3303b02caca566a093aa56fee692">tvm</a>
</li>
<li>WithFields()
-: <a class="el" href="namespacetvm_1_1relay.html#aaf3bb67945ee37070acbf4b3ef84d826">tvm::relay</a>
+: <a class="el" href="namespacetvm_1_1relay.html#ab877823176936e61bad173cba1d8052b">tvm::relay</a>
+, <a class="el" href="namespacetvm.html#a8eca58e10b2f3d0c8cc2da12c9d33c82">tvm</a>
</li>
<li>WithoutAttr()
: <a class="el" href="namespacetvm.html#a7e2bc626db8be997b1562c79df3d9e11">tvm</a>
diff --git a/docs/reference/api/doxygen/namespacemembers_w.html b/docs/reference/api/doxygen/namespacemembers_w.html
index 6828d9808..350633dc2 100644
--- a/docs/reference/api/doxygen/namespacemembers_w.html
+++ b/docs/reference/api/doxygen/namespacemembers_w.html
@@ -74,7 +74,8 @@ $(function() {
: <a class="el" href="namespacetvm.html#aa01d3303b02caca566a093aa56fee692">tvm</a>
</li>
<li>WithFields()
-: <a class="el" href="namespacetvm_1_1relay.html#aaf3bb67945ee37070acbf4b3ef84d826">tvm::relay</a>
+: <a class="el" href="namespacetvm_1_1relay.html#ab877823176936e61bad173cba1d8052b">tvm::relay</a>
+, <a class="el" href="namespacetvm.html#a8eca58e10b2f3d0c8cc2da12c9d33c82">tvm</a>
</li>
<li>WithoutAttr()
: <a class="el" href="namespacetvm.html#a7e2bc626db8be997b1562c79df3d9e11">tvm</a>
diff --git a/docs/reference/api/doxygen/namespacetvm.html b/docs/reference/api/doxygen/namespacetvm.html
index b582d9f22..0cf082510 100644
--- a/docs/reference/api/doxygen/namespacetvm.html
+++ b/docs/reference/api/doxygen/namespacetvm.html
@@ -651,6 +651,9 @@ Functions</h2></td></tr>
<tr class="memitem:afa0a9bdf3997ef4fad45b19fb1a655cd"><td class="memItemLeft" align="right" valign="top"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">runtime::ObjectRef</a> </td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm.html#afa0a9bdf3997ef4fad45b19fb1a655cd">LoadJSON</a> (std::string json_str)</td></tr>
<tr class="memdesc:afa0a9bdf3997ef4fad45b19fb1a655cd"><td class="mdescLeft"> </td><td class="mdescRight">Internal implementation of LoadJSON Load tvm Node object from json and return a shared_ptr of Node. <a href="#afa0a9bdf3997ef4fad45b19fb1a655cd">More...</a><br /></td></tr>
<tr class="separator:afa0a9bdf3997ef4fad45b19fb1a655cd"><td class="memSeparator" colspan="2"> </td></tr>
+<tr class="memitem:a8eca58e10b2f3d0c8cc2da12c9d33c82"><td class="memItemLeft" align="right" valign="top"><a class="el" href="classtvm_1_1GlobalVar.html">GlobalVar</a> </td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm.html#a8eca58e10b2f3d0c8cc2da12c9d33c82">WithFields</a> (<a class="el" href="classtvm_1_1GlobalVar.html">GlobalVar</a> global_var, <a class="el" href="classtvm_1_1runtime_1_1Optional.html">Optional</a>< <a class="el" href="classtvm_1_1runt [...]
+<tr class="memdesc:a8eca58e10b2f3d0c8cc2da12c9d33c82"><td class="mdescLeft"> </td><td class="mdescRight">Returns <code>global_var</code> with the given properties. A null property denotes 'no change'. Returns <code>global_var</code> if all properties are unchanged. Otherwise, returns a copy with the new fields. <a href="#a8eca58e10b2f3d0c8cc2da12c9d33c82">More...</a><br /></td></tr>
+<tr class="separator:a8eca58e10b2f3d0c8cc2da12c9d33c82"><td class="memSeparator" colspan="2"> </td></tr>
<tr class="memitem:a741dec82c75bea850290cf8bc412c006"><td class="memItemLeft" align="right" valign="top">void </td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm.html#a741dec82c75bea850290cf8bc412c006">CheckAndUpdateHostConsistency</a> (<a class="el" href="classtvm_1_1Target.html">Target</a> *target, <a class="el" href="classtvm_1_1Target.html">Target</a> *host)</td></tr>
<tr class="memdesc:a741dec82c75bea850290cf8bc412c006"><td class="mdescLeft"> </td><td class="mdescRight">Check and update host field of the given legacy target and target host pair. Note that this function is for legacy target api compatibility issue only, not recommended for other use. <a href="#a741dec82c75bea850290cf8bc412c006">More...</a><br /></td></tr>
<tr class="separator:a741dec82c75bea850290cf8bc412c006"><td class="memSeparator" colspan="2"> </td></tr>
@@ -2460,8 +2463,8 @@ template<typename TAttrs > </div>
</div>
</div>
-<a id="a3bbd28c926db4678937fe228f77451d2"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a3bbd28c926db4678937fe228f77451d2">◆ </a></span>bitwise_or() <span class="overload">[2/3]</span></h2>
+<a id="a4fa04a2f38468bc444468209b0f38188"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a4fa04a2f38468bc444468209b0f38188">◆ </a></span>bitwise_or() <span class="overload">[2/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -2472,13 +2475,13 @@ template<typename TAttrs > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::bitwise_or </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -2502,8 +2505,8 @@ template<typename TAttrs > </div>
</div>
</div>
-<a id="a4fa04a2f38468bc444468209b0f38188"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a4fa04a2f38468bc444468209b0f38188">◆ </a></span>bitwise_or() <span class="overload">[3/3]</span></h2>
+<a id="a3bbd28c926db4678937fe228f77451d2"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a3bbd28c926db4678937fe228f77451d2">◆ </a></span>bitwise_or() <span class="overload">[3/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -2514,13 +2517,13 @@ template<typename TAttrs > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::bitwise_or </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -3361,8 +3364,8 @@ template<typename TAttrs > </div>
</div>
</div>
-<a id="a9d90476be7c95b4c40fe267fd2af5603"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a9d90476be7c95b4c40fe267fd2af5603">◆ </a></span>div() <span class="overload">[5/6]</span></h2>
+<a id="a421c6836f0e87cd662320a8f6c23d452"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a421c6836f0e87cd662320a8f6c23d452">◆ </a></span>div() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -3379,7 +3382,7 @@ template<typename TAttrs > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -3403,8 +3406,8 @@ template<typename TAttrs > </div>
</div>
</div>
-<a id="a421c6836f0e87cd662320a8f6c23d452"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a421c6836f0e87cd662320a8f6c23d452">◆ </a></span>div() <span class="overload">[6/6]</span></h2>
+<a id="a9d90476be7c95b4c40fe267fd2af5603"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a9d90476be7c95b4c40fe267fd2af5603">◆ </a></span>div() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -3421,7 +3424,7 @@ template<typename TAttrs > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -3808,8 +3811,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a435a9df348bdb72e60bfe4ce410dcc58"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a435a9df348bdb72e60bfe4ce410dcc58">◆ </a></span>floordiv() <span class="overload">[2/3]</span></h2>
+<a id="a87200564215339b900ca546678fc71a4"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a87200564215339b900ca546678fc71a4">◆ </a></span>floordiv() <span class="overload">[2/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -3820,13 +3823,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::floordiv </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -3850,8 +3853,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a87200564215339b900ca546678fc71a4"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a87200564215339b900ca546678fc71a4">◆ </a></span>floordiv() <span class="overload">[3/3]</span></h2>
+<a id="a435a9df348bdb72e60bfe4ce410dcc58"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a435a9df348bdb72e60bfe4ce410dcc58">◆ </a></span>floordiv() <span class="overload">[3/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -3862,13 +3865,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::floordiv </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -4436,8 +4439,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="aa7f616193a71c13d01ce3a3fab469f9d"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#aa7f616193a71c13d01ce3a3fab469f9d">◆ </a></span>greater_equal() <span class="overload">[2/6]</span></h2>
+<a id="a78f0e420a400bb38907d21cfeaab8e18"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a78f0e420a400bb38907d21cfeaab8e18">◆ </a></span>greater_equal() <span class="overload">[2/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -4448,13 +4451,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::greater_equal </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -4604,8 +4607,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a78f0e420a400bb38907d21cfeaab8e18"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a78f0e420a400bb38907d21cfeaab8e18">◆ </a></span>greater_equal() <span class="overload">[6/6]</span></h2>
+<a id="aa7f616193a71c13d01ce3a3fab469f9d"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#aa7f616193a71c13d01ce3a3fab469f9d">◆ </a></span>greater_equal() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -4616,13 +4619,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::greater_equal </td>
<td>(</td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -5398,8 +5401,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a6bc108896d74f5f3b5cc3b98e9780e1c"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a6bc108896d74f5f3b5cc3b98e9780e1c">◆ </a></span>left_shift() <span class="overload">[2/3]</span></h2>
+<a id="a58fbf68a58a7f32935d6c4539d292a08"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a58fbf68a58a7f32935d6c4539d292a08">◆ </a></span>left_shift() <span class="overload">[2/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5410,13 +5413,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::left_shift </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -5440,8 +5443,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a58fbf68a58a7f32935d6c4539d292a08"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a58fbf68a58a7f32935d6c4539d292a08">◆ </a></span>left_shift() <span class="overload">[3/3]</span></h2>
+<a id="a6bc108896d74f5f3b5cc3b98e9780e1c"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a6bc108896d74f5f3b5cc3b98e9780e1c">◆ </a></span>left_shift() <span class="overload">[3/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5452,13 +5455,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::left_shift </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -5528,8 +5531,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="ae29f761564dc96582c113b69b3d93aaa"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#ae29f761564dc96582c113b69b3d93aaa">◆ </a></span>less() <span class="overload">[2/6]</span></h2>
+<a id="a9bbb69dc3563e07d5f81c003a7ad9aed"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a9bbb69dc3563e07d5f81c003a7ad9aed">◆ </a></span>less() <span class="overload">[2/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5546,7 +5549,7 @@ template<typename TA > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -5612,8 +5615,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a0d5dc442dde0e69657c40803d394ea73"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a0d5dc442dde0e69657c40803d394ea73">◆ </a></span>less() <span class="overload">[4/6]</span></h2>
+<a id="a042ab9c595a8a0f63c07dbbdf75ecf9c"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a042ab9c595a8a0f63c07dbbdf75ecf9c">◆ </a></span>less() <span class="overload">[4/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5624,7 +5627,7 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::less </td>
<td>(</td>
- <td class="paramtype">float </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
@@ -5654,8 +5657,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a9bbb69dc3563e07d5f81c003a7ad9aed"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a9bbb69dc3563e07d5f81c003a7ad9aed">◆ </a></span>less() <span class="overload">[5/6]</span></h2>
+<a id="ae29f761564dc96582c113b69b3d93aaa"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#ae29f761564dc96582c113b69b3d93aaa">◆ </a></span>less() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5672,7 +5675,7 @@ template<typename TA > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -5696,8 +5699,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a042ab9c595a8a0f63c07dbbdf75ecf9c"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a042ab9c595a8a0f63c07dbbdf75ecf9c">◆ </a></span>less() <span class="overload">[6/6]</span></h2>
+<a id="a0d5dc442dde0e69657c40803d394ea73"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a0d5dc442dde0e69657c40803d394ea73">◆ </a></span>less() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5708,7 +5711,7 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::less </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
@@ -5784,8 +5787,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a59f1a9bebe7948e2570b8c01386253d4"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a59f1a9bebe7948e2570b8c01386253d4">◆ </a></span>less_equal() <span class="overload">[2/6]</span></h2>
+<a id="a0626fa40cd72920c914ab38a6546a332"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a0626fa40cd72920c914ab38a6546a332">◆ </a></span>less_equal() <span class="overload">[2/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5802,7 +5805,7 @@ template<typename TA > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -5826,8 +5829,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a0626fa40cd72920c914ab38a6546a332"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a0626fa40cd72920c914ab38a6546a332">◆ </a></span>less_equal() <span class="overload">[3/6]</span></h2>
+<a id="aec0ac319177760ff01be833bae8b72bf"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#aec0ac319177760ff01be833bae8b72bf">◆ </a></span>less_equal() <span class="overload">[3/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5844,7 +5847,7 @@ template<typename TA > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -5868,8 +5871,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="aec0ac319177760ff01be833bae8b72bf"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#aec0ac319177760ff01be833bae8b72bf">◆ </a></span>less_equal() <span class="overload">[4/6]</span></h2>
+<a id="a5cee73ced0a40ed261dc3beec9f8247c"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a5cee73ced0a40ed261dc3beec9f8247c">◆ </a></span>less_equal() <span class="overload">[4/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5880,13 +5883,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::less_equal </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -5910,8 +5913,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a5cee73ced0a40ed261dc3beec9f8247c"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a5cee73ced0a40ed261dc3beec9f8247c">◆ </a></span>less_equal() <span class="overload">[5/6]</span></h2>
+<a id="a59f1a9bebe7948e2570b8c01386253d4"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a59f1a9bebe7948e2570b8c01386253d4">◆ </a></span>less_equal() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -5922,13 +5925,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::less_equal </td>
<td>(</td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -6891,8 +6894,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="aa1eb9772f6fbb245cda93c5bd9e53e7d"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#aa1eb9772f6fbb245cda93c5bd9e53e7d">◆ </a></span>max() <span class="overload">[6/7]</span></h2>
+<a id="aa22a313c142a61845ded7fdf77af7046"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#aa22a313c142a61845ded7fdf77af7046">◆ </a></span>max() <span class="overload">[6/7]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -6909,7 +6912,7 @@ template<typename TA > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -6933,8 +6936,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="aa22a313c142a61845ded7fdf77af7046"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#aa22a313c142a61845ded7fdf77af7046">◆ </a></span>max() <span class="overload">[7/7]</span></h2>
+<a id="aa1eb9772f6fbb245cda93c5bd9e53e7d"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#aa1eb9772f6fbb245cda93c5bd9e53e7d">◆ </a></span>max() <span class="overload">[7/7]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -6951,7 +6954,7 @@ template<typename TA > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -7193,8 +7196,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="acfa7fdecbf7391561b96ab5ad4ef21ed"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#acfa7fdecbf7391561b96ab5ad4ef21ed">◆ </a></span>min() <span class="overload">[5/7]</span></h2>
+<a id="a256aa9e1b6ed1c8dbadc529aae0eddc3"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a256aa9e1b6ed1c8dbadc529aae0eddc3">◆ </a></span>min() <span class="overload">[5/7]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -7205,13 +7208,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::min </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -7277,8 +7280,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a256aa9e1b6ed1c8dbadc529aae0eddc3"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a256aa9e1b6ed1c8dbadc529aae0eddc3">◆ </a></span>min() <span class="overload">[7/7]</span></h2>
+<a id="acfa7fdecbf7391561b96ab5ad4ef21ed"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#acfa7fdecbf7391561b96ab5ad4ef21ed">◆ </a></span>min() <span class="overload">[7/7]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -7289,13 +7292,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::min </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -7439,8 +7442,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="add1522db4005299ee5d3ee0c7460466d"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#add1522db4005299ee5d3ee0c7460466d">◆ </a></span>mul() <span class="overload">[2/6]</span></h2>
+<a id="af64e20cff6f9f74660c8068469f146b7"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#af64e20cff6f9f74660c8068469f146b7">◆ </a></span>mul() <span class="overload">[2/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -7451,13 +7454,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::mul </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -7481,8 +7484,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a92b9c69c93190d9057dd6f73ff93797a"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a92b9c69c93190d9057dd6f73ff93797a">◆ </a></span>mul() <span class="overload">[3/6]</span></h2>
+<a id="a7a63f5958a0158fc2707f8888bcf13f2"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a7a63f5958a0158fc2707f8888bcf13f2">◆ </a></span>mul() <span class="overload">[3/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -7499,7 +7502,7 @@ template<typename TA > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -7523,8 +7526,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="a7a63f5958a0158fc2707f8888bcf13f2"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a7a63f5958a0158fc2707f8888bcf13f2">◆ </a></span>mul() <span class="overload">[4/6]</span></h2>
+<a id="a92b9c69c93190d9057dd6f73ff93797a"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a92b9c69c93190d9057dd6f73ff93797a">◆ </a></span>mul() <span class="overload">[4/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -7541,7 +7544,7 @@ template<typename TA > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -7565,8 +7568,8 @@ template<typename TA > </div>
</div>
</div>
-<a id="af64e20cff6f9f74660c8068469f146b7"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#af64e20cff6f9f74660c8068469f146b7">◆ </a></span>mul() <span class="overload">[5/6]</span></h2>
+<a id="add1522db4005299ee5d3ee0c7460466d"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#add1522db4005299ee5d3ee0c7460466d">◆ </a></span>mul() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -7577,13 +7580,13 @@ template<typename TA > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::mul </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -8135,8 +8138,8 @@ template<> </div>
</div>
</div>
-<a id="a1975d7ce2d2cbaef575fef1198550f43"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a1975d7ce2d2cbaef575fef1198550f43">◆ </a></span>operator &&() <span class="overload">[5/6]</span></h2>
+<a id="a453fac64e53716a977f867cb9665fde9"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a453fac64e53716a977f867cb9665fde9">◆ </a></span>operator &&() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8147,13 +8150,13 @@ template<> </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator&& </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">bool </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">bool </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8171,8 +8174,8 @@ template<> </div>
</div>
</div>
-<a id="a453fac64e53716a977f867cb9665fde9"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a453fac64e53716a977f867cb9665fde9">◆ </a></span>operator &&() <span class="overload">[6/6]</span></h2>
+<a id="a1975d7ce2d2cbaef575fef1198550f43"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a1975d7ce2d2cbaef575fef1198550f43">◆ </a></span>operator &&() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8183,13 +8186,13 @@ template<> </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator&& </td>
<td>(</td>
- <td class="paramtype">bool </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">bool </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8351,8 +8354,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="af70bb4a982810d795dbd17ce73c6b124"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#af70bb4a982810d795dbd17ce73c6b124">◆ </a></span>operator*() <span class="overload">[2/6]</span></h2>
+<a id="a1815d8b152819885a5733554f374a9ca"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a1815d8b152819885a5733554f374a9ca">◆ </a></span>operator*() <span class="overload">[2/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8369,7 +8372,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8387,8 +8390,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="ace5dbde3bde1ba48d14a3f9064a45aee"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#ace5dbde3bde1ba48d14a3f9064a45aee">◆ </a></span>operator*() <span class="overload">[3/6]</span></h2>
+<a id="af70bb4a982810d795dbd17ce73c6b124"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#af70bb4a982810d795dbd17ce73c6b124">◆ </a></span>operator*() <span class="overload">[3/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8399,13 +8402,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator* </td>
<td>(</td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8423,8 +8426,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="aca621e1d2df8562819bc021c1410b741"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#aca621e1d2df8562819bc021c1410b741">◆ </a></span>operator*() <span class="overload">[4/6]</span></h2>
+<a id="ace5dbde3bde1ba48d14a3f9064a45aee"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#ace5dbde3bde1ba48d14a3f9064a45aee">◆ </a></span>operator*() <span class="overload">[4/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8435,7 +8438,7 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator* </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
@@ -8459,8 +8462,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a6823188ec16be854223bbffe349c975d"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a6823188ec16be854223bbffe349c975d">◆ </a></span>operator*() <span class="overload">[5/6]</span></h2>
+<a id="aca621e1d2df8562819bc021c1410b741"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#aca621e1d2df8562819bc021c1410b741">◆ </a></span>operator*() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8471,13 +8474,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator* </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8495,8 +8498,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a1815d8b152819885a5733554f374a9ca"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a1815d8b152819885a5733554f374a9ca">◆ </a></span>operator*() <span class="overload">[6/6]</span></h2>
+<a id="a6823188ec16be854223bbffe349c975d"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a6823188ec16be854223bbffe349c975d">◆ </a></span>operator*() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8513,7 +8516,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8678,8 +8681,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="ad728a6c2c3d21242a4df808aadb722eb"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#ad728a6c2c3d21242a4df808aadb722eb">◆ </a></span>operator+() <span class="overload">[4/6]</span></h2>
+<a id="a50bfde26f015ed64e1c0341dd65d3fad"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a50bfde26f015ed64e1c0341dd65d3fad">◆ </a></span>operator+() <span class="overload">[4/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8690,13 +8693,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator+ </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8750,8 +8753,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a50bfde26f015ed64e1c0341dd65d3fad"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a50bfde26f015ed64e1c0341dd65d3fad">◆ </a></span>operator+() <span class="overload">[6/6]</span></h2>
+<a id="ad728a6c2c3d21242a4df808aadb722eb"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#ad728a6c2c3d21242a4df808aadb722eb">◆ </a></span>operator+() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8762,13 +8765,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator+ </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8889,8 +8892,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a028ba217f99b6cb1592a6a56b2bc9ee5"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a028ba217f99b6cb1592a6a56b2bc9ee5">◆ </a></span>operator-() <span class="overload">[3/7]</span></h2>
+<a id="a679ff94dec26779d8769231abb229647"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a679ff94dec26779d8769231abb229647">◆ </a></span>operator-() <span class="overload">[3/7]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8907,7 +8910,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8925,8 +8928,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a679ff94dec26779d8769231abb229647"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a679ff94dec26779d8769231abb229647">◆ </a></span>operator-() <span class="overload">[4/7]</span></h2>
+<a id="a4f40ad3340a853d58664bc864dc10d47"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a4f40ad3340a853d58664bc864dc10d47">◆ </a></span>operator-() <span class="overload">[4/7]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8937,13 +8940,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator- </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8961,8 +8964,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="aef861fe5325bc0b415a905a24c42f10a"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#aef861fe5325bc0b415a905a24c42f10a">◆ </a></span>operator-() <span class="overload">[5/7]</span></h2>
+<a id="af7c46ff33a2727f48b10d7d563f4a746"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#af7c46ff33a2727f48b10d7d563f4a746">◆ </a></span>operator-() <span class="overload">[5/7]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -8973,13 +8976,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator- </td>
<td>(</td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -8997,8 +9000,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="af7c46ff33a2727f48b10d7d563f4a746"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#af7c46ff33a2727f48b10d7d563f4a746">◆ </a></span>operator-() <span class="overload">[6/7]</span></h2>
+<a id="a028ba217f99b6cb1592a6a56b2bc9ee5"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a028ba217f99b6cb1592a6a56b2bc9ee5">◆ </a></span>operator-() <span class="overload">[6/7]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9015,7 +9018,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9033,8 +9036,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a4f40ad3340a853d58664bc864dc10d47"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a4f40ad3340a853d58664bc864dc10d47">◆ </a></span>operator-() <span class="overload">[7/7]</span></h2>
+<a id="aef861fe5325bc0b415a905a24c42f10a"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#aef861fe5325bc0b415a905a24c42f10a">◆ </a></span>operator-() <span class="overload">[7/7]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9045,7 +9048,7 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator- </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
@@ -9259,8 +9262,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a0854363590c38f5479b1da5e70c4f002"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a0854363590c38f5479b1da5e70c4f002">◆ </a></span>operator<() <span class="overload">[2/6]</span></h2>
+<a id="a4c5092e248ab7daa5de5c22717670d8e"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a4c5092e248ab7daa5de5c22717670d8e">◆ </a></span>operator<() <span class="overload">[2/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9277,7 +9280,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9295,8 +9298,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a4c5092e248ab7daa5de5c22717670d8e"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a4c5092e248ab7daa5de5c22717670d8e">◆ </a></span>operator<() <span class="overload">[3/6]</span></h2>
+<a id="a46877235265ab97544ec2e561f521b0f"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a46877235265ab97544ec2e561f521b0f">◆ </a></span>operator<() <span class="overload">[3/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9307,13 +9310,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator< </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9331,8 +9334,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="abc5d3aba4f3f15098d5ac2fb0c3dfd39"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#abc5d3aba4f3f15098d5ac2fb0c3dfd39">◆ </a></span>operator<() <span class="overload">[4/6]</span></h2>
+<a id="aa672271dbd566a0e7b9e4c87664bccb4"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#aa672271dbd566a0e7b9e4c87664bccb4">◆ </a></span>operator<() <span class="overload">[4/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9343,13 +9346,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator< </td>
<td>(</td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9367,8 +9370,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="aa672271dbd566a0e7b9e4c87664bccb4"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#aa672271dbd566a0e7b9e4c87664bccb4">◆ </a></span>operator<() <span class="overload">[5/6]</span></h2>
+<a id="a0854363590c38f5479b1da5e70c4f002"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a0854363590c38f5479b1da5e70c4f002">◆ </a></span>operator<() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9385,7 +9388,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9403,8 +9406,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a46877235265ab97544ec2e561f521b0f"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a46877235265ab97544ec2e561f521b0f">◆ </a></span>operator<() <span class="overload">[6/6]</span></h2>
+<a id="abc5d3aba4f3f15098d5ac2fb0c3dfd39"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#abc5d3aba4f3f15098d5ac2fb0c3dfd39">◆ </a></span>operator<() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9415,7 +9418,7 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator< </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
@@ -9478,8 +9481,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a9d8412e5f401f59f5ca85ed556d70810"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a9d8412e5f401f59f5ca85ed556d70810">◆ </a></span>operator<<() <span class="overload">[2/3]</span></h2>
+<a id="ad0449d28f23318cc5163159a58c80ba3"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#ad0449d28f23318cc5163159a58c80ba3">◆ </a></span>operator<<() <span class="overload">[2/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9490,13 +9493,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator<< </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9514,8 +9517,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="ad0449d28f23318cc5163159a58c80ba3"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#ad0449d28f23318cc5163159a58c80ba3">◆ </a></span>operator<<() <span class="overload">[3/3]</span></h2>
+<a id="a9d8412e5f401f59f5ca85ed556d70810"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a9d8412e5f401f59f5ca85ed556d70810">◆ </a></span>operator<<() <span class="overload">[3/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9526,13 +9529,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator<< </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9625,8 +9628,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a06d97bd5ee2c12e8547be0cc42f6b300"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a06d97bd5ee2c12e8547be0cc42f6b300">◆ </a></span>operator<=() <span class="overload">[3/6]</span></h2>
+<a id="a6eea8276bcc178425bc14f3d878970ff"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a6eea8276bcc178425bc14f3d878970ff">◆ </a></span>operator<=() <span class="overload">[3/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9643,7 +9646,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9661,8 +9664,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a6eea8276bcc178425bc14f3d878970ff"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a6eea8276bcc178425bc14f3d878970ff">◆ </a></span>operator<=() <span class="overload">[4/6]</span></h2>
+<a id="a872f50bd7175eccf440865311aa75232"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a872f50bd7175eccf440865311aa75232">◆ </a></span>operator<=() <span class="overload">[4/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9673,13 +9676,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator<= </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9697,8 +9700,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a872f50bd7175eccf440865311aa75232"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a872f50bd7175eccf440865311aa75232">◆ </a></span>operator<=() <span class="overload">[5/6]</span></h2>
+<a id="ad5dbec0c48b8644c5c6e9d773ddc106b"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#ad5dbec0c48b8644c5c6e9d773ddc106b">◆ </a></span>operator<=() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9709,7 +9712,7 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator<= </td>
<td>(</td>
- <td class="paramtype">float </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
@@ -9733,8 +9736,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="ad5dbec0c48b8644c5c6e9d773ddc106b"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#ad5dbec0c48b8644c5c6e9d773ddc106b">◆ </a></span>operator<=() <span class="overload">[6/6]</span></h2>
+<a id="a06d97bd5ee2c12e8547be0cc42f6b300"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a06d97bd5ee2c12e8547be0cc42f6b300">◆ </a></span>operator<=() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9745,13 +9748,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator<= </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9955,8 +9958,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="acc92dcd3d81981e983ddf05347bc9371"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#acc92dcd3d81981e983ddf05347bc9371">◆ </a></span>operator>() <span class="overload">[2/6]</span></h2>
+<a id="a6aeb6ed068c5de8ab908ff234337aeeb"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a6aeb6ed068c5de8ab908ff234337aeeb">◆ </a></span>operator>() <span class="overload">[2/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -9967,13 +9970,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator> </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -9991,8 +9994,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a6d0ad14c882c11311836138a2c164cf3"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a6d0ad14c882c11311836138a2c164cf3">◆ </a></span>operator>() <span class="overload">[3/6]</span></h2>
+<a id="a9cea8f3789d8f3dc78acae43e9a6aad6"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a9cea8f3789d8f3dc78acae43e9a6aad6">◆ </a></span>operator>() <span class="overload">[3/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10003,13 +10006,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator> </td>
<td>(</td>
- <td class="paramtype">float </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10027,8 +10030,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a6aeb6ed068c5de8ab908ff234337aeeb"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a6aeb6ed068c5de8ab908ff234337aeeb">◆ </a></span>operator>() <span class="overload">[4/6]</span></h2>
+<a id="a6d0ad14c882c11311836138a2c164cf3"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a6d0ad14c882c11311836138a2c164cf3">◆ </a></span>operator>() <span class="overload">[4/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10039,7 +10042,7 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator> </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
@@ -10063,8 +10066,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a9cea8f3789d8f3dc78acae43e9a6aad6"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a9cea8f3789d8f3dc78acae43e9a6aad6">◆ </a></span>operator>() <span class="overload">[5/6]</span></h2>
+<a id="a7e2181bca182f90533ec35537714d09d"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a7e2181bca182f90533ec35537714d09d">◆ </a></span>operator>() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10081,7 +10084,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10099,8 +10102,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a7e2181bca182f90533ec35537714d09d"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a7e2181bca182f90533ec35537714d09d">◆ </a></span>operator>() <span class="overload">[6/6]</span></h2>
+<a id="acc92dcd3d81981e983ddf05347bc9371"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#acc92dcd3d81981e983ddf05347bc9371">◆ </a></span>operator>() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10117,7 +10120,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10174,8 +10177,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a35961a6074b72fae0dfc48ee395e0673"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a35961a6074b72fae0dfc48ee395e0673">◆ </a></span>operator>=() <span class="overload">[2/6]</span></h2>
+<a id="aae1dcfef78728c5490d3c107b4abac5a"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#aae1dcfef78728c5490d3c107b4abac5a">◆ </a></span>operator>=() <span class="overload">[2/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10192,7 +10195,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">double </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10210,8 +10213,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="aae1dcfef78728c5490d3c107b4abac5a"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#aae1dcfef78728c5490d3c107b4abac5a">◆ </a></span>operator>=() <span class="overload">[3/6]</span></h2>
+<a id="ac194836fc11a8ba34e44738da17fd116"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#ac194836fc11a8ba34e44738da17fd116">◆ </a></span>operator>=() <span class="overload">[3/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10228,7 +10231,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">double </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10246,8 +10249,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="af7dee311b945dfc5a821a119c1db9ad1"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#af7dee311b945dfc5a821a119c1db9ad1">◆ </a></span>operator>=() <span class="overload">[4/6]</span></h2>
+<a id="a35961a6074b72fae0dfc48ee395e0673"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a35961a6074b72fae0dfc48ee395e0673">◆ </a></span>operator>=() <span class="overload">[4/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10258,13 +10261,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator>= </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10318,8 +10321,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="ac194836fc11a8ba34e44738da17fd116"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#ac194836fc11a8ba34e44738da17fd116">◆ </a></span>operator>=() <span class="overload">[6/6]</span></h2>
+<a id="af7dee311b945dfc5a821a119c1db9ad1"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#af7dee311b945dfc5a821a119c1db9ad1">◆ </a></span>operator>=() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10330,13 +10333,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator>= </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10504,8 +10507,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a6f638564e5e4d1023096523800f2579e"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a6f638564e5e4d1023096523800f2579e">◆ </a></span>operator^() <span class="overload">[2/3]</span></h2>
+<a id="a82dc2fe21e7a64be5a1b11c2a8775d31"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a82dc2fe21e7a64be5a1b11c2a8775d31">◆ </a></span>operator^() <span class="overload">[2/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10516,13 +10519,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator^ </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10540,8 +10543,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a82dc2fe21e7a64be5a1b11c2a8775d31"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a82dc2fe21e7a64be5a1b11c2a8775d31">◆ </a></span>operator^() <span class="overload">[3/3]</span></h2>
+<a id="a6f638564e5e4d1023096523800f2579e"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a6f638564e5e4d1023096523800f2579e">◆ </a></span>operator^() <span class="overload">[3/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10552,13 +10555,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator^ </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10834,8 +10837,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a1a3f9ad4d0e25eee9c0b3a9c83114bc0"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a1a3f9ad4d0e25eee9c0b3a9c83114bc0">◆ </a></span>operator||() <span class="overload">[5/6]</span></h2>
+<a id="a873bb60c71f37cbb743e21797a53ba06"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a873bb60c71f37cbb743e21797a53ba06">◆ </a></span>operator||() <span class="overload">[5/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10846,13 +10849,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator|| </td>
<td>(</td>
- <td class="paramtype">bool </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">bool </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -10870,8 +10873,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a873bb60c71f37cbb743e21797a53ba06"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a873bb60c71f37cbb743e21797a53ba06">◆ </a></span>operator||() <span class="overload">[6/6]</span></h2>
+<a id="a1a3f9ad4d0e25eee9c0b3a9c83114bc0"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a1a3f9ad4d0e25eee9c0b3a9c83114bc0">◆ </a></span>operator||() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -10882,13 +10885,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::operator|| </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">bool </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">bool </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em> </td>
</tr>
<tr>
@@ -11287,8 +11290,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a98ff4361d0a24570f8dc32d03cde972a"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a98ff4361d0a24570f8dc32d03cde972a">◆ </a></span>right_shift() <span class="overload">[2/3]</span></h2>
+<a id="af49dde9dfdeea62e8ad3a6d8db53de0b"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#af49dde9dfdeea62e8ad3a6d8db53de0b">◆ </a></span>right_shift() <span class="overload">[2/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -11299,13 +11302,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::right_shift </td>
<td>(</td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -11329,8 +11332,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="af49dde9dfdeea62e8ad3a6d8db53de0b"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#af49dde9dfdeea62e8ad3a6d8db53de0b">◆ </a></span>right_shift() <span class="overload">[3/3]</span></h2>
+<a id="a98ff4361d0a24570f8dc32d03cde972a"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a98ff4361d0a24570f8dc32d03cde972a">◆ </a></span>right_shift() <span class="overload">[3/3]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -11341,13 +11344,13 @@ template<typename TB > </div>
<tr>
<td class="memname"><a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> tvm::right_shift </td>
<td>(</td>
- <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>a</em>, </td>
</tr>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">const <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> & </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -11756,8 +11759,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="a7470d45dafa0a91b6c62b25cdd61514e"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a7470d45dafa0a91b6c62b25cdd61514e">◆ </a></span>sub() <span class="overload">[2/6]</span></h2>
+<a id="af2d75a528d344c6cfcf8b726a6abb7cc"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#af2d75a528d344c6cfcf8b726a6abb7cc">◆ </a></span>sub() <span class="overload">[2/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -11774,7 +11777,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">int </td>
+ <td class="paramtype">float </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -11924,8 +11927,8 @@ template<typename TB > </div>
</div>
</div>
-<a id="af2d75a528d344c6cfcf8b726a6abb7cc"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#af2d75a528d344c6cfcf8b726a6abb7cc">◆ </a></span>sub() <span class="overload">[6/6]</span></h2>
+<a id="a7470d45dafa0a91b6c62b25cdd61514e"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a7470d45dafa0a91b6c62b25cdd61514e">◆ </a></span>sub() <span class="overload">[6/6]</span></h2>
<div class="memitem">
<div class="memproto">
@@ -11942,7 +11945,7 @@ template<typename TB > </div>
<tr>
<td class="paramkey"></td>
<td></td>
- <td class="paramtype">float </td>
+ <td class="paramtype">int </td>
<td class="paramname"><em>b</em>, </td>
</tr>
<tr>
@@ -12549,6 +12552,54 @@ template<typename TFunc > </div>
</dl>
<dl class="section return"><dt>Returns</dt><dd>The new function or module with updated attributes. </dd></dl>
+</div>
+</div>
+<a id="a8eca58e10b2f3d0c8cc2da12c9d33c82"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#a8eca58e10b2f3d0c8cc2da12c9d33c82">◆ </a></span>WithFields()</h2>
+
+<div class="memitem">
+<div class="memproto">
+ <table class="memname">
+ <tr>
+ <td class="memname"><a class="el" href="classtvm_1_1GlobalVar.html">GlobalVar</a> tvm::WithFields </td>
+ <td>(</td>
+ <td class="paramtype"><a class="el" href="classtvm_1_1GlobalVar.html">GlobalVar</a> </td>
+ <td class="paramname"><em>global_var</em>, </td>
+ </tr>
+ <tr>
+ <td class="paramkey"></td>
+ <td></td>
+ <td class="paramtype"><a class="el" href="classtvm_1_1runtime_1_1Optional.html">Optional</a>< <a class="el" href="classtvm_1_1runtime_1_1String.html">String</a> > </td>
+ <td class="paramname"><em>opt_name_hint</em> = <code>{}</code>, </td>
+ </tr>
+ <tr>
+ <td class="paramkey"></td>
+ <td></td>
+ <td class="paramtype"><a class="el" href="classtvm_1_1runtime_1_1Optional.html">Optional</a>< <a class="el" href="classtvm_1_1Type.html">Type</a> > </td>
+ <td class="paramname"><em>opt_type</em> = <code>{}</code>, </td>
+ </tr>
+ <tr>
+ <td class="paramkey"></td>
+ <td></td>
+ <td class="paramtype"><a class="el" href="classtvm_1_1runtime_1_1Optional.html">Optional</a>< <a class="el" href="classtvm_1_1VirtualDevice.html">VirtualDevice</a> > </td>
+ <td class="paramname"><em>opt_virtual_device</em> = <code>{}</code>, </td>
+ </tr>
+ <tr>
+ <td class="paramkey"></td>
+ <td></td>
+ <td class="paramtype"><a class="el" href="classtvm_1_1runtime_1_1Optional.html">Optional</a>< <a class="el" href="classtvm_1_1Span.html">Span</a> > </td>
+ <td class="paramname"><em>opt_span</em> = <code>{}</code> </td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>)</td>
+ <td></td><td></td>
+ </tr>
+ </table>
+</div><div class="memdoc">
+
+<p>Returns <code>global_var</code> with the given properties. A null property denotes 'no change'. Returns <code>global_var</code> if all properties are unchanged. Otherwise, returns a copy with the new fields. </p>
+
</div>
</div>
<a id="a7e2bc626db8be997b1562c79df3d9e11"></a>
diff --git a/docs/reference/api/doxygen/namespacetvm_1_1relay.html b/docs/reference/api/doxygen/namespacetvm_1_1relay.html
index 49abf4e35..5d6f475b7 100644
--- a/docs/reference/api/doxygen/namespacetvm_1_1relay.html
+++ b/docs/reference/api/doxygen/namespacetvm_1_1relay.html
@@ -948,10 +948,10 @@ Enumerations</h2></td></tr>
<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="func-members"></a>
Functions</h2></td></tr>
<tr class="memitem:acd80501d29e4d951be6746c79934a70c"><td class="memItemLeft" align="right" valign="top"><a class="el" href="classtvm_1_1relay_1_1Clause.html">Clause</a> </td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1relay.html#acd80501d29e4d951be6746c79934a70c">WithFields</a> (<a class="el" href="classtvm_1_1relay_1_1Clause.html">Clause</a> clause, <a class="el" href="classtvm_1_1runtime_1_1Optional.html">Optional</a>< <a class="el" href="class [...]
-<tr class="memdesc:acd80501d29e4d951be6746c79934a70c"><td class="mdescLeft"> </td><td class="mdescRight">Returns the clause with given properties. A null property denotes 'no change'. Returns clause if all properties are unchanged. Otherwise, returns a copy with the new fields. <a href="#acd80501d29e4d951be6746c79934a70c">More...</a><br /></td></tr>
+<tr class="memdesc:acd80501d29e4d951be6746c79934a70c"><td class="mdescLeft"> </td><td class="mdescRight">Returns <code>clause</code> with the given properties. A null property denotes 'no change'. Returns <code>clause</code> if all properties are unchanged. Otherwise, returns a copy with the new fields. <a href="#acd80501d29e4d951be6746c79934a70c">More...</a><br /></td></tr>
<tr class="separator:acd80501d29e4d951be6746c79934a70c"><td class="memSeparator" colspan="2"> </td></tr>
<tr class="memitem:adb39b46f86b66a5e7252f6d9102deb7b"><td class="memItemLeft" align="right" valign="top"><a class="el" href="classtvm_1_1relay_1_1Match.html">Match</a> </td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1relay.html#adb39b46f86b66a5e7252f6d9102deb7b">WithFields</a> (<a class="el" href="classtvm_1_1relay_1_1Match.html">Match</a> match, <a class="el" href="classtvm_1_1runtime_1_1Optional.html">Optional</a>< <a class="el" href="namespacet [...]
-<tr class="memdesc:adb39b46f86b66a5e7252f6d9102deb7b"><td class="mdescLeft"> </td><td class="mdescRight">Returns the match with given properties. A null property denotes 'no change'. Returns match if all properties are unchanged. Otherwise, returns a copy with the new fields. <a href="#adb39b46f86b66a5e7252f6d9102deb7b">More...</a><br /></td></tr>
+<tr class="memdesc:adb39b46f86b66a5e7252f6d9102deb7b"><td class="mdescLeft"> </td><td class="mdescRight">Returns <code>match</code> with the given properties. A null property denotes 'no change'. Returns <code>match</code> if all properties are unchanged. Otherwise, returns a copy with the new fields. <a href="#adb39b46f86b66a5e7252f6d9102deb7b">More...</a><br /></td></tr>
<tr class="separator:adb39b46f86b66a5e7252f6d9102deb7b"><td class="memSeparator" colspan="2"> </td></tr>
<tr class="memitem:a9c09d2d83aa356218069b1def8046ee7"><td class="memItemLeft" align="right" valign="top"><a class="el" href="namespacetvm.html#acd267f8d7f55da6ac681239831963279">Kind</a> </td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1relay.html#a9c09d2d83aa356218069b1def8046ee7">KindCheck</a> (const <a class="el" href="namespacetvm_1_1relay.html#a661d95f170bca230773914caeef3fe52">Type</a> &t, const <a class="el" href="classtvm_1_1IRModule.html" [...]
<tr class="memdesc:a9c09d2d83aa356218069b1def8046ee7"><td class="mdescLeft"> </td><td class="mdescRight">Check that types are well kinded by applying "kinding rules". <a href="#a9c09d2d83aa356218069b1def8046ee7">More...</a><br /></td></tr>
@@ -1046,32 +1046,35 @@ Functions</h2></td></tr>
<tr class="memitem:a48710b93ea41c2d528b042010bd12b7b"><td class="memItemLeft" align="right" valign="top"><a class="el" href="classtvm_1_1relay_1_1DFPattern.html">DFPattern</a> </td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1relay.html#a48710b93ea41c2d528b042010bd12b7b">IsTupleGetItem</a> (const <a class="el" href="classtvm_1_1relay_1_1DFPattern.html">DFPattern</a> tuple, int index=-1)</td></tr>
... 4105 lines suppressed ...