You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/04/19 05:12:31 UTC
[GitHub] [incubator-tvm] yongfeng-nv commented on issue #5367: WIP: Improve floormod (Don't merge)

yongfeng-nv commented on issue #5367: WIP: Improve floormod (Don't merge)
URL: https://github.com/apache/incubator-tvm/pull/5367#issuecomment-616035227
 
 
   The current behavior of IntervalSet floormod(a, b) is rough -- it returns [0, b-1], [-b+1, b-1], or everything.  It causes extra iterations to my schedule.  This is a simplified version:
   
       import tvm
       from tvm import te
       
       def test_bound_tile_mod():
           def compute(M_tiles, N_tiles, factor, dtype):
               # Algo
               M = M_tiles * factor
               N = N_tiles * factor
       
               A = tvm.te.placeholder((N, M), name='A', dtype=dtype)
               C = tvm.te.compute((N, M), lambda n, m: A[n, m], name='C')
               s = tvm.te.create_schedule(C.op)
       
               return s, A, C
       
           def schedule(s, factor, padding, A, C):
               C_local = s.cache_write(C, "local")
       
               n, m = C.op.axis
               bn, bm, ni, mi = s[C].tile(n, m, factor, factor)
               nio, nii = s[C].split(ni, 2)
               n = s[C].fuse(nii, mi)
               C_shared = s.cache_write(C, "shared")
               bn, bm, ni, mi = C_shared.op.axis       
               s[C_shared].storage_align(ni, factor * 2, padding)
       
               n, m = s[C].op.axis
               bn, bm, ni, mi = s[C].tile(n, m, factor, factor)
               s[C].set_scope("global")
               niio, niii = s[C].split(ni, 32)
               s[C_shared].compute_at(s[C], niio)
       
               return s
       
           s, A, C = compute(2, 2, 128, "float16")
           s = schedule(s, 128, 8, A, C)
           bounds = tvm.te.schedule.InferBound(s)
           print(tvm.lower(s, [A, C], simple_mode=True))
   
   It does 256x256 point-wise copying in 128x128 tiles from local to shared then to global.  I would like to allocate only a 32x128 shared memory and reuse it four times per tile.  In addition, I want to pad 8 data every two tile rows to avoid bank conflicts in shared memory with storage_align.  I expect C.shared (the local ->shared stage) does exactly 256x256 copying, but the following IR shows 4 times of that.
   
       // attr [C.local] storage_scope = "local"
       allocate C.local[float16 * 65536]
       // attr [C.shared] storage_scope = "shared"
       allocate C.shared[float16 * 16896]
       for (n.c, 0, 256) {
         for (m.c, 0, 256) {
           C.local[((n.c*256) + m.c)] = A[((n.c*256) + m.c)]
         }
       }
       for (n.outer, 0, 2) {
         for (m.outer, 0, 2) {
           for (n.inner.outer, 0, 4) {
             for (n.inner.outer.c, 0, 64) {
               for (n.inner.inner.m.inner.fused.c, 0, 256) {
                 C.shared[((n.inner.outer.c*264) + n.inner.inner.m.inner.fused.c)] = C.local[(((((n.outer*32768) + (n.inner.outer.c*512)) + (floordiv(n.inner.inner.m.inner.fused.c, 128)*256)) + (m.outer*128)) + floormod(n.inner.inner.m.inner.fused.c, 128))]
               }
             }
             for (n.inner.inner, 0, 32) {
               for (m.inner, 0, 128) {
                 C[(((((n.outer*32768) + (n.inner.outer*8192)) + (n.inner.inner*256)) + (m.outer*128)) + m.inner)] = C.shared[((((n.inner.outer*4224) + (floordiv(n.inner.inner, 2)*264)) + (floormod(n.inner.inner, 2)*128)) + m.inner)]
               }
             }
           }
         }
       }
               
   If there is another schedule to achieve this, please let me know.  As of now, I can only accomplish this by tiling both C.shared and C.  C's compute now involves several floormod: `C.shared(floordiv(n, 128), floordiv(m, 128), floordiv(floormod(n, 128), 2), ((floormod(floormod(n, 128), 2)*128) + floormod(m, 128)))`.  `m` and `n` further bring in leaf IterVars to floormod to PropBoundToInputs() during InferBound.  Because the best result from floormod(x, 128) is [0, 128-1], C.shared doesn't reduce domain, although it does compute_at s[C]'s n.inner.outer with range [0,3].
   
   With this PR, I am able to get expected IR -- `(n.inner.outer.c, 0, 64)` becoming `(n.inner.outer.c, 0, 16)`:
   
       // attr [C.local] storage_scope = "local"
       allocate C.local[float16 * 65536]
       // attr [C.shared] storage_scope = "shared"
       allocate C.shared[float16 * 4224]
       for (n.c, 0, 256) {
         for (m.c, 0, 256) {
           C.local[((n.c*256) + m.c)] = A[((n.c*256) + m.c)]
         }
       }
       for (n.outer, 0, 2) {
         for (m.outer, 0, 2) {
           for (n.inner.outer, 0, 4) {
             for (n.inner.outer.c, 0, 16) {
               for (n.inner.inner.m.inner.fused.c, 0, 256) {
                 C.shared[((n.inner.outer.c*264) + n.inner.inner.m.inner.fused.c)] = C.local[((((((n.outer*32768) + (n.inner.outer*8192)) + (n.inner.outer.c*512)) + (floordiv(n.inner.inner.m.inner.fused.c, 128)*256)) + (m.outer*128)) + floormod(n.inner.inner.m.inner.fused.c, 128))]
               }
             }
             for (n.inner.inner, 0, 32) {
               for (m.inner, 0, 128) {
                 C[(((((n.outer*32768) + (n.inner.outer*8192)) + (n.inner.inner*256)) + (m.outer*128)) + m.inner)] = C.shared[(((floordiv(n.inner.inner, 2)*264) + (floormod(n.inner.inner, 2)*128)) + m.inner)]
               }
             }
           }
         }
       }
   
   
   The improvement has two folds:
   1.   EvalSet takes a range map as input to help calculating floormod.  For example:  `[x*4, x*4+3] % 8 = [x*4-x/2*8, x*4+3-(x*4+3)/8*8)]`, if we don't know `x`'s range; but `[x*4, x*4+3]`, if we know `x`'s range is `[0, 1]`.  This is common from tiling.
         
   2.   For `a mod b`, if b is an IntImm, single point, and greater than 0, we do the following:
   
       // a mod b = a - b * (a/b) if
       // (a) a_max - a_min < b, i.e. that before mod, a's range doesn't cover [0, b)
       // and (b) a_min mod b <= a_max mod b, i.e. that a's range is still continuous after mod
   so that `[13, 5] % 10 = [3, 5]`
   
   IntervalSet evaluation is hard.  These improvements are adhoc to the problems I encountered.  I open to any alternatives.
   
   Question: how about mod?  It's implementation looks same as floormod.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services