You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/12/11 18:10:26 UTC
[GitHub] [tvm] kk2049 opened a new issue #9715: [Bug] Te.gradient not work with complex forward workload
kk2049 opened a new issue #9715:
URL: https://github.com/apache/tvm/issues/9715
### My problem
I am trying to use autoscheduler to generate CUDA source code for backward stage for NCHW winograd_conv2d. Due to some bugs in topi.cuda.conv2d_winograd.winograd_cuda, I copied some code to build my workload.
Luckily, this workload works without te.gradient and can successfully get source code for the forward stage. But when I add te.gradient, this workload no longer works and I get an error msg below: `Check failed: (!repl_op.same_as(s->op)) is false: Cannot find Tensor(shape=[4, 2], op.name=A) in the inputs of compute(extracted_tensor.d.shared, ......`
I am really confued now. Forward stage codegen can work proves that my workload is correct in some way. So I think this bug may caused by a bug in TVM, but I am not sure.
Maybe someone can help me find out whether it is a bug about TVM.
Thanks a lot!!!
### Expected behavior
This code should find a valid schedule
### Actual behavior
I get a error below when I start tunning.
```
Get devices for measurement successfully!
----------------------------------------------------------------------
------------------------------ [ Search ]
----------------------------------------------------------------------
Traceback (most recent call last):
File "bug_scheduler.py", line 189, in <module>
task.tune(tune_option)
File "/data/anaconda3/envs/env3.7/lib/python3.7/site-packages/tvm-0.8.0-py3.7-linux-x86_64.egg/tvm/auto_scheduler/search_task.py", line 498, in tune
_ffi_api.AutoSchedule(search_policy, tuning_options)
File "/data/anaconda3/envs/env3.7/lib/python3.7/site-packages/tvm-0.8.0-py3.7-linux-x86_64.egg/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
13: TVMFuncCall
12: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::runtime::Array<tvm::runtime::ObjectRef, void> (tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)>::AssignTypedLambda<tvm::auto_scheduler::{lambda(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)#3}>(tvm::auto_scheduler::{lambda(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)#3}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
11: tvm::auto_scheduler::AutoSchedule(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)
10: tvm::auto_scheduler::SketchPolicyNode::Search(int, int, int, tvm::auto_scheduler::ProgramMeasurer)
9: tvm::auto_scheduler::SketchPolicyNode::SearchOneRound(int, tvm::runtime::Array<tvm::auto_scheduler::State, void>*)
8: tvm::auto_scheduler::SketchPolicyNode::GenerateSketches()
7: tvm::auto_scheduler::RuleAddCacheRead::Apply(tvm::auto_scheduler::SketchPolicyNode const&, tvm::auto_scheduler::State const&, int) const
6: tvm::auto_scheduler::State::cache_read(int, tvm::runtime::String const&, tvm::runtime::Array<tvm::Integer, void> const&, tvm::auto_scheduler::ComputeDAG const&)
5: tvm::auto_scheduler::CacheReadStepNode::ApplyToState(tvm::auto_scheduler::State*, tvm::auto_scheduler::ComputeDAG const&) const
4: tvm::auto_scheduler::ComputeDAG::ReplayAndGetDAG(tvm::runtime::Array<tvm::auto_scheduler::Step, void> const&) const
3: tvm::auto_scheduler::ComputeDAG::ApplySteps(tvm::runtime::Array<tvm::auto_scheduler::Step, void> const&, tvm::runtime::Array<tvm::te::Stage, void>*, tvm::runtime::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, tvm::auto_scheduler::LayoutRewriteOption) const
2: tvm::auto_scheduler::StepApplyToSchedule(tvm::auto_scheduler::Step const&, tvm::runtime::Array<tvm::te::Stage, void>*, tvm::runtime::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, tvm::te::Schedule*, tvm::runtime::Array<tvm::auto_scheduler::Step, void> const&)
1: tvm::auto_scheduler::CacheReadStepNode::ApplyToSchedule(tvm::runtime::Array<tvm::te::Stage, void>*, tvm::runtime::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, tvm::te::Schedule*) const
0: tvm::te::Schedule::cache_read(tvm::te::Tensor const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::Array<tvm::te::Operation, void> const&)
File "/data/apache-tvm-src-v0.8.0.rc0/src/te/schedule/schedule_dataflow_rewrite.cc", line 168
TVMError:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
Check failed: (!repl_op.same_as(s->op)) is false: Cannot find Tensor(shape=[4, 2], op.name=A) in the inputs of compute(extracted_tensor.d.shared, body=[extracted_tensor[ax0, ax1, ax2, ax3]], axis=[iter_var(ax0, range(min=0, ext=2)), iter_var(ax1, range(min=0, ext=2)), iter_var(ax2, range(min=0, ext=4)), iter_var(ax3, range(min=0, ext=4))], reduce_axis=[], tag=, attrs={})
```
### Environment
My system is Ubuntun16.04
CUDA version is 10.2
My tvm version is 0.8.0. I build it with the source code from Download Apache TVM Source Code web page.
### Steps to reproduce
I am sorry about put such a long code here to make sure this bug can be reproduced. I have tried to cut out some part of my code to reproduce this error, but this bug can only be triggered by this long code.
``` python
import os
import numpy as np
import tvm
from tvm import auto_scheduler
import logging
from tvm import te, topi
from tvm import autotvm
from tvm.topi import nn
from tvm.topi.utils import get_const_int, get_const_tuple, traverse_inline
from tvm.topi.nn.winograd_util import winograd_transform_matrices
from tvm.topi.nn.conv2d import conv2d_winograd_nhwc, _conv2d_winograd_nhwc_impl
import sys
import numpy as np
from tvm.topi.testing import conv2d_nchw_python
def _infer_tile_size(data, kernel, layout="NCHW"):
if layout == "NCHW":
N, CI, H, W = get_const_tuple(data.shape)
else:
assert layout == "NHWC"
N, H, W, CI = get_const_tuple(data.shape)
if H % 8 == 0:
return 4
return 2
@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
data = te.placeholder((N, CI, H, W), name="data")
kernel = te.placeholder((CO, CI, KH, KW), name="kernel")
stride = (1,1)
padding = (1,1)
dilation = (1,1)
pre_computed = False
out_dtype = "float32"
tile_size = _infer_tile_size(data, kernel)
N, CI, H, W = get_const_tuple(data.shape)
if isinstance(N, tvm.tir.Any):
N = tvm.te.size_var("n")
if not isinstance(H, int) or not isinstance(W, int):
raise RuntimeError(
"cuda winograd conv2d doesn't support dynamic input\
height or width."
)
if isinstance(dilation, int):
dilation_h = dilation_w = dilation
else:
dilation_h, dilation_w = dilation
HSTR, WSTR = (stride, stride) if isinstance(stride, int) else stride
if not pre_computed: # kernel tensor is raw tensor, do strict check
if dilation_h != 1 or dilation_w != 1:
kernel = nn.dilate(kernel, (1, 1, dilation_h, dilation_w))
CO, CI, KH, KW = get_const_tuple(kernel.shape)
alpha = KW + tile_size - 1
assert HSTR == 1 and WSTR == 1 and KH == KW
else:
# kernel tensor is pre-transfomred. this op is created by alter op layout.
# dilation is not supported
alpha, _, CI, CO = get_const_tuple(kernel.shape)
KH = KW = alpha + 1 - tile_size
assert HSTR == 1 and WSTR == 1 and dilation_h == 1 and dilation_w == 1
pt, pl, pb, pr = nn.get_pad_tuple(padding, (KH, KW))
data_pad = nn.pad(data, (0, 0, pt, pl), (0, 0, pb, pr), name="data_pad")
r = KW
m = tile_size
A, B, G = winograd_transform_matrices(m, r, out_dtype)
H = (H + pt + pb - KH) // HSTR + 1
W = (W + pl + pr - KW) // WSTR + 1
nH, nW = (H + m - 1) // m, (W + m - 1) // m
P = N * nH * nW if isinstance(N, int) else nH * nW
# transform kernel
if not pre_computed:
r_kh = te.reduce_axis((0, KH), name="r_kh")
r_kw = te.reduce_axis((0, KW), name="r_kw")
kernel_pack = te.compute(
(alpha, alpha, CI, CO),
lambda eps, nu, ci, co: te.sum(
kernel[co][ci][r_kh][r_kw] * G[eps][r_kh] * G[nu][r_kw], axis=[r_kh, r_kw]
),
name="my_kernel_pack",
)
else:
kernel_pack = kernel
idxdiv = tvm.tir.indexdiv
idxmod = tvm.tir.indexmod
# pack input tile
input_tile = te.compute(
(CI, P, alpha, alpha),
lambda c, p, eps_1, nu_1: data_pad[idxdiv(p, (nH * nW))][c][
idxmod(idxdiv(p, nW), nH) * m + eps_1
][idxmod(p, nW) * m + nu_1],
name="my_d",
)
# dy = tvm.te.placeholder(input_tile.shape, name="input2_dy")
# [dw] = tvm.te.gradient(input_tile, [data], head=dy)
# return [data, kernel, input_tile, dy, dw]
# transform data
r_a = te.reduce_axis((0, alpha), "r_a")
r_b = te.reduce_axis((0, alpha), "r_b")
data_pack = te.compute(
(alpha, alpha, CI, P),
lambda eps, nu, ci, p: te.sum(
input_tile[ci][p][r_a][r_b] * B[r_a][eps] * B[r_b][nu], axis=[r_a, r_b]
),
name="my_data_pack",
)
# dy = tvm.te.placeholder(data_pack.shape, name="input2_dy")
# [dw] = tvm.te.gradient(data_pack, [data], head=dy)
# return [data, kernel, data_pack, dy, dw]
# do batch gemm
ci = te.reduce_axis((0, CI), name="ci")
bgemm = te.compute(
(alpha, alpha, CO, P),
lambda eps, nu, co, p: te.sum(
kernel_pack[eps][nu][ci][co] * data_pack[eps][nu][ci][p], axis=[ci]
),
name="my_bgemm",
)
# inverse transform
r_a_2 = te.reduce_axis((0, alpha), "r_a_2")
r_b_2 = te.reduce_axis((0, alpha), "r_b_2")
inverse = te.compute(
(CO, P, m, m),
lambda co, p, vh, vw: te.sum(
bgemm[r_a_2][r_b_2][co][p] * A[r_a_2][vh] * A[r_b_2][vw], axis=[r_a_2, r_b_2]
),
name="my_inverse",
)
# output
output = te.compute(
(N, CO, H, W),
lambda n, co, h, w: inverse[
co, n * nH * nW + idxdiv(h, m) * nW + idxdiv(w, m), idxmod(h, m), idxmod(w, m)
],
name="my_output",
tag="conv2d_nchw_winograd",
)
dy = tvm.te.placeholder(output.shape, name="input2_dy")
[dw] = tvm.te.gradient(output, [data], head=dy)
return [data, kernel, output,dy,dw]
# return [data, kernel, output]
target = tvm.target.Target("cuda")
# Use the last layer in ResNet-50
N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target
)
# Inspect the computational graph
print("Computational DAG:")
print(task.compute_dag)
log_file = "conv2d.json"
if os.path.exists(log_file):
os.remove(log_file)
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=10, # change this to 1000 to achieve the best performance
runner=measure_ctx.runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
verbose=2,
)
# Run auto-tuning (search)
task.tune(tune_option)
# Apply the best schedule
sch, args = task.apply_best(log_file)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] kk2049 commented on issue #9715: [Bug] Te.gradient not work with complex forward workload
Posted by GitBox <gi...@apache.org>.
kk2049 commented on issue #9715:
URL: https://github.com/apache/tvm/issues/9715#issuecomment-995867087
@comaniac Sorry to bother you. (I really appreciate your help about `te.gradient` months ago #8991 ) I wonder if I can get your help again about this problem. I am confusing about this bug and have no idea how to fix it. Thanks a lot!!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] comaniac commented on issue #9715: [Bug] Te.gradient not work with complex forward workload
Posted by GitBox <gi...@apache.org>.
comaniac commented on issue #9715:
URL: https://github.com/apache/tvm/issues/9715#issuecomment-996055035
It looks like auto-scheduler has issues when generating the schedule sketch for this workload. You could first try to build and run this workload on CPU without tuning to see if we could identify the problem. If that doesn't work, then it must be something wrong with the workload or te.gradient. Otherwise, we could investigate the compute DAG to see why auto-scheduler failed to work on this workload generated by te.gradient.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [tvm] kk2049 commented on issue #9715: [Bug] Te.gradient not work with complex forward workload
Posted by GitBox <gi...@apache.org>.
kk2049 commented on issue #9715:
URL: https://github.com/apache/tvm/issues/9715#issuecomment-997443436
@comaniac Thanks for your reply! I have tried to run this workload with `tvm.target.Target("llvm")`. This workload can be successfully launched. So I select `target("cuda")` again and tried to print the compute DAG. It looks like this:
```
Computational DAG:
kernel = PLACEHOLDER [512, 512, 3, 3]
G(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 3) == 2)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 3) == 1)), ..(OMITTED).. (floormod(i, 4) == 0) && (floormod(j, 3) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 3) == 0)), 1f, 0f))))))))))))
my_kernel_pack(eps, nu, ci, co) += ((kernel[co, ci, r_kh, r_kw]*G[eps, r_kh])*G[nu, r_kw])
data = PLACEHOLDER [1, 512, 7, 7]
data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f)
my_d(c, p, eps_1, nu_1) = data_pad[floordiv(p, 16), c, ((floormod(floordiv(p, 4), 4)*2) + eps_1), ((floormod(p, 4)*2) + nu_1)]
B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
my_data_pack(eps, nu, ci, p) += ((my_d[ci, p, r_a, r_b]*B[r_a, eps])*B[r_b, nu])
my_bgemm(eps, nu, co, p) += (my_kernel_pack[eps, nu, ci, co]*my_data_pack[eps, nu, ci, p])
A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
my_inverse(co, p, vh, vw) += ((my_bgemm[r_a_2, r_b_2, co, p]*A[r_a_2, vh])*A[r_b_2, vw])
my_output(n, co, h, w) = my_inverse[co, ((((n*4)*4) + (floordiv(h, 2)*4)) + floordiv(w, 2)), floormod(h, 2), floormod(w, 2)]
input2_dy = PLACEHOLDER [1, 512, 7, 7]
my_output.my_inverse.grad(ax0, ax1, ax2, ax3) = select((((((((ax2*4) + (floordiv((7 + (ax1*-2)), 8)*-8)) <= 24) && (((ax1*-2) + ..(OMITTED).. ) <= 15)), input2_dy[0, ax0, (ax2 + (floordiv((7 + (ax1*-2)), 8)*-2)), (((floordiv((7 + (ax1*-2)), 8)*8) + (ax1*2)) + ax3)], 0f)
extracted_tensor(n0_n0_vh.shifted.shifted, n1_n1_vw.shifted.shifted, n2_n2_jac_i0.shifted.shifted, n3_n3_jac_i1.shifted.shifted) = (A[n2_n2_jac_i0.shifted.shifted, n0_n0_vh.shifted.shifted]*A[n3_n3_jac_i1.shifted.shifted, n1_n1_vw.shifted.shifted])
my_inverse.my_bgemm.grad(ax0, ax1, ax2, ax3) += (my_output.my_inverse.grad[ax2, ax3, n0_n0_k2.shifted.shifted, n1_n1_k3.shifted.shifted]*extracted_tensor[n0_n0_k2.shifted.shifted, n1_n1_k3.shifted.shifted, ax0, ax1])
my_bgemm.my_data_pack.grad(ax0, ax1, ax2, ax3) += (my_inverse.my_bgemm.grad[ax0, ax1, n0_n0_k2.shifted.shifted, ax3]*my_kernel_pack[ax0, ax1, ax2, n0_n0_k2.shifted.shifted])
extracted_tensor(n0_n0_eps.shifted.shifted, n1_n1_nu.shifted.shifted, n4_n4_jac_i2.shifted.shifted, n5_n5_jac_i3.shifted.shifted) = (B[n4_n4_jac_i2.shifted.shifted, n0_n0_eps.shifted.shifted]*B[n5_n5_jac_i3.shifted.shifted, n1_n1_nu.shifted.shifted])
my_data_pack.my_d.grad(ax0, ax1, ax2, ax3) += (my_bgemm.my_data_pack.grad[n0_n0_k0.shifted.shifted, n1_n1_k1.shifted.shifted, ax0, ax1]*extracted_tensor[n0_n0_k0.shifted.shifted, n1_n1_k1.shifted.shifted, ax2, ax3])
data_pad.data.grad(ax0, ax1, ax2, ax3) += my_data_pack.my_d.grad[ax1, (((((floordiv((ax2 + 1), 2) + n0_n0_fdiv1.shifted.shifted) ..(OMITTED).. ormod((ax2 + 1), 2) + (n0_n0_fdiv1.shifted.shifted*-2)) + 2), ((floormod((ax3 + 1), 2) + (n1_n1_fmod1.shifted.shifted*-2)) + 2)]
```
I have tried to check this DAG info myself but failed to anything useful. Maybe you can find something in it?
Thanks a lot for your help!!!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org