You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/04/10 10:38:47 UTC
[GitHub] [incubator-tvm] roastduck opened a new issue #5303: [TIR] Buggy
LoopPartition
roastduck opened a new issue #5303: [TIR] Buggy LoopPartition
URL: https://github.com/apache/incubator-tvm/issues/5303
Pass `LoopPartition` may be wrong when there are variables unknown at compile time.
Here's the example.
```python
import tvm
import topi
import numpy as np
dtype = "float32"
target = "cuda"
m = 32
n_max = 1024
n = tvm.te.placeholder((), name="n", dtype="int32")
a = tvm.te.placeholder((n_max,), name="a", dtype=dtype)
def f(i):
j = tvm.te.reduce_axis((0, n), name="j")
return tvm.te.sum(i * a[j], axis=[j])
b = tvm.te.compute((m,), f)
with tvm.target.cuda():
s = tvm.te.create_schedule(b.op)
blk_x = tvm.te.thread_axis("blockIdx.x")
th_x = tvm.te.thread_axis("threadIdx.x")
i, = b.op.axis
j, = b.op.reduce_axis
i_outer, i_inner = s[b].split(i, nparts=1)
j_outer, j_inner = s[b].split(j, factor=m)
s[b].reorder(i_outer, j_outer, i_inner, j_inner)
s[b].bind(i_outer, blk_x)
s[b].bind(i_inner, th_x)
a_cache = s.cache_read(a, "shared", [b])
s[a_cache].compute_at(s[b], j_outer)
a_axis, = a_cache.op.axis
a_axis, _ = s[a_cache].split(a_axis, factor=1) # Workaround TVM Discuss Question 4826
s[a_cache].bind(a_axis, th_x)
print(tvm.lower(s, [n, a, b], target, simple_mode=True))
compute = tvm.build(s, [n, a, b], target, name="run")
print(compute.imported_modules[0].get_source())
```
In this example, we tile `a` with factor 32, and cache it to the shared memory. The major characteristic of this example is that the reduce length `n` is unknown at compile time, so TVM will emit some boundary checkings.
We first print the IR, and then print the generated CUDA code. The output is as follows:
```
produce compute {
// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1
// attr [a.shared] storage_scope = "shared"
allocate a.shared[float32 * 32]
// attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
compute[threadIdx.x] = 0f
for (j.outer, 0, floordiv((n[0] + 31), 32)) {
produce a.shared {
// attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
if (likely((((j.outer*32) + threadIdx.x) < 1024))) { // LOOK AT HERE
a.shared[threadIdx.x] = a[((j.outer*32) + threadIdx.x)]
}
}
// attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
for (j.inner, 0, 32) {
if (likely((((j.outer*32) + j.inner) < n[0]))) {
compute[threadIdx.x] = (compute[threadIdx.x] + (float32(threadIdx.x)*a.shared[j.inner]))
}
}
}
}
[18:24:47] /home/rd/src/incubator-tvm/src/tir/pass/loop_partition.cc:533: Cannot prove:((((floordiv((n[0] + 31), 32) - 1) - 32) + 1) >= 0), when generating the post doubt loop
extern "C" __global__ void run_kernel0(void* __restrict__ compute, void* __restrict__ n, void* __restrict__ a) {
__shared__ float a_shared[32];
(( float*)compute)[(((int)threadIdx.x))] = 0.000000e+00f;
for (int j_outer = 0; j_outer < min(32, (((( int*)n)[(0)] + 31) >> 5)); ++j_outer) { // LOOK AT HERE
__syncthreads();
a_shared[(((int)threadIdx.x))] = (( float*)a)[(((j_outer * 32) + ((int)threadIdx.x)))]; // LOOK AT HERE
__syncthreads();
for (int j_inner = 0; j_inner < 32; ++j_inner) {
if (((j_outer * 32) + j_inner) < (( int*)n)[(0)]) {
(( float*)compute)[(((int)threadIdx.x))] = ((( float*)compute)[(((int)threadIdx.x))] + (((float)((int)threadIdx.x)) * a_shared[(j_inner)]));
}
}
}
for (int j_outer1 = 0; j_outer1 < max(((((( int*)n)[(0)] + 31) >> 5) - 32), 0); ++j_outer1) {
__syncthreads();
if ((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + ((int)threadIdx.x)) < 1024) {
a_shared[(((int)threadIdx.x))] = (( float*)a)[((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + ((int)threadIdx.x)))];
}
__syncthreads();
for (int j_inner1 = 0; j_inner1 < 32; ++j_inner1) {
if ((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + j_inner1) < (( int*)n)[(0)]) {
(( float*)compute)[(((int)threadIdx.x))] = ((( float*)compute)[(((int)threadIdx.x))] + (((float)((int)threadIdx.x)) * a_shared[(j_inner1)]));
}
}
}
}
```
Note that there's a `if (likely((((j.outer*32) + threadIdx.x) < 1024)))` checking in the IR for `a.shared` to guard its boundary. Then, because of the `LoopPartition` Pass, this checking divides the surrounding loop into two parts in the CUDA code.
Here's the point. In the CUDA code, the first loop is of length 32, in which the checking has been removed. However, **`(j.outer*32) + threadIdx.x) < 1024` is not always true, because `(j.outer*32) + threadIdx.x) < 32 * 32 + threadIdx.x == 1024 + threadIdx.x`. The upper bound of `threadIdx.x` should be deduced from 32.** The loop should be shorter, or the checking cannot be removed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [incubator-tvm] roastduck commented on issue #5303: [TIR] Buggy
LoopPartition
Posted by GitBox <gi...@apache.org>.
roastduck commented on issue #5303: [TIR] Buggy LoopPartition
URL: https://github.com/apache/incubator-tvm/issues/5303#issuecomment-612059701
It's my misunderstanding. `j.outer < 32` (not `<=`), so `(j.outer*32) + threadIdx.x) < 1024` still holds.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [incubator-tvm] roastduck closed issue #5303: [TIR] Buggy
LoopPartition
Posted by GitBox <gi...@apache.org>.
roastduck closed issue #5303: [TIR] Buggy LoopPartition
URL: https://github.com/apache/incubator-tvm/issues/5303
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services