You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/04/10 10:38:47 UTC

[GitHub] [incubator-tvm] roastduck opened a new issue #5303: [TIR] Buggy LoopPartition

roastduck opened a new issue #5303: [TIR] Buggy LoopPartition
URL: https://github.com/apache/incubator-tvm/issues/5303
 
 
   Pass `LoopPartition` may be wrong when there are variables unknown at compile time.
   
   Here's the example.
   
   ```python
   import tvm
   import topi
   import numpy as np
   
   dtype = "float32"
   target = "cuda"
   m = 32
   n_max = 1024
   
   n = tvm.te.placeholder((), name="n", dtype="int32")
   a = tvm.te.placeholder((n_max,), name="a", dtype=dtype)
   def f(i):
       j = tvm.te.reduce_axis((0, n), name="j")
       return tvm.te.sum(i * a[j], axis=[j])
   b = tvm.te.compute((m,), f)
   
   with tvm.target.cuda():
       s = tvm.te.create_schedule(b.op)
       blk_x = tvm.te.thread_axis("blockIdx.x")
       th_x = tvm.te.thread_axis("threadIdx.x")
   
       i, = b.op.axis
       j, = b.op.reduce_axis
       i_outer, i_inner = s[b].split(i, nparts=1)
       j_outer, j_inner = s[b].split(j, factor=m)
       s[b].reorder(i_outer, j_outer, i_inner, j_inner)
       s[b].bind(i_outer, blk_x)
       s[b].bind(i_inner, th_x)
   
       a_cache = s.cache_read(a, "shared", [b])
       s[a_cache].compute_at(s[b], j_outer)
       a_axis, = a_cache.op.axis
       a_axis, _ = s[a_cache].split(a_axis, factor=1) # Workaround TVM Discuss Question 4826
       s[a_cache].bind(a_axis, th_x)
   
   print(tvm.lower(s, [n, a, b], target, simple_mode=True))
   compute = tvm.build(s, [n, a, b], target, name="run")
   print(compute.imported_modules[0].get_source())
   ```
   
   In this example, we tile `a` with factor 32, and cache it to the shared memory. The major characteristic of this example is that the reduce length `n` is unknown at compile time, so TVM will emit some boundary checkings.
   
   We first print the IR, and then print the generated CUDA code. The output is as follows:
   
   ```
   produce compute {
     // attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1
     // attr [a.shared] storage_scope = "shared"
     allocate a.shared[float32 * 32]
     // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
     compute[threadIdx.x] = 0f
     for (j.outer, 0, floordiv((n[0] + 31), 32)) {
       produce a.shared {
         // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
         if (likely((((j.outer*32) + threadIdx.x) < 1024))) { // LOOK AT HERE
           a.shared[threadIdx.x] = a[((j.outer*32) + threadIdx.x)]
         }
       }
       // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
       for (j.inner, 0, 32) {
         if (likely((((j.outer*32) + j.inner) < n[0]))) {
           compute[threadIdx.x] = (compute[threadIdx.x] + (float32(threadIdx.x)*a.shared[j.inner]))
         }
       }
     }
   }
   
   [18:24:47] /home/rd/src/incubator-tvm/src/tir/pass/loop_partition.cc:533: Cannot prove:((((floordiv((n[0] + 31), 32) - 1) - 32) + 1) >= 0), when generating the post doubt loop
   extern "C" __global__ void run_kernel0(void* __restrict__ compute, void* __restrict__ n, void* __restrict__ a) {
     __shared__ float a_shared[32];
     (( float*)compute)[(((int)threadIdx.x))] = 0.000000e+00f;
     for (int j_outer = 0; j_outer < min(32, (((( int*)n)[(0)] + 31) >> 5)); ++j_outer) { // LOOK AT HERE
       __syncthreads();
       a_shared[(((int)threadIdx.x))] = (( float*)a)[(((j_outer * 32) + ((int)threadIdx.x)))]; // LOOK AT HERE
       __syncthreads();
       for (int j_inner = 0; j_inner < 32; ++j_inner) {
         if (((j_outer * 32) + j_inner) < (( int*)n)[(0)]) {
           (( float*)compute)[(((int)threadIdx.x))] = ((( float*)compute)[(((int)threadIdx.x))] + (((float)((int)threadIdx.x)) * a_shared[(j_inner)]));
         }
       }
     }
     for (int j_outer1 = 0; j_outer1 < max(((((( int*)n)[(0)] + 31) >> 5) - 32), 0); ++j_outer1) {
       __syncthreads();
       if ((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + ((int)threadIdx.x)) < 1024) {
         a_shared[(((int)threadIdx.x))] = (( float*)a)[((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + ((int)threadIdx.x)))];
       }
       __syncthreads();
       for (int j_inner1 = 0; j_inner1 < 32; ++j_inner1) {
         if ((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + j_inner1) < (( int*)n)[(0)]) {
           (( float*)compute)[(((int)threadIdx.x))] = ((( float*)compute)[(((int)threadIdx.x))] + (((float)((int)threadIdx.x)) * a_shared[(j_inner1)]));
         }
       }
     }
   }
   ```
   
   Note that there's a `if (likely((((j.outer*32) + threadIdx.x) < 1024)))` checking in the IR for `a.shared` to guard its boundary. Then, because of the `LoopPartition` Pass, this checking divides the surrounding loop into two parts in the CUDA code.
   
   Here's the point. In the CUDA code, the first loop is of length 32, in which the checking has been removed. However, **`(j.outer*32) + threadIdx.x) < 1024` is not always true, because `(j.outer*32) + threadIdx.x) < 32 * 32 + threadIdx.x == 1024 + threadIdx.x`. The upper bound of `threadIdx.x` should be deduced from 32.** The loop should be shorter, or the checking cannot be removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] roastduck commented on issue #5303: [TIR] Buggy LoopPartition

Posted by GitBox <gi...@apache.org>.

roastduck commented on issue #5303: [TIR] Buggy LoopPartition
URL: https://github.com/apache/incubator-tvm/issues/5303#issuecomment-612059701
 
 
   It's my misunderstanding. `j.outer < 32` (not `<=`), so `(j.outer*32) + threadIdx.x) < 1024` still holds.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] roastduck closed issue #5303: [TIR] Buggy LoopPartition

Posted by GitBox <gi...@apache.org>.

roastduck closed issue #5303: [TIR] Buggy LoopPartition
URL: https://github.com/apache/incubator-tvm/issues/5303
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services