You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/07/22 06:56:13 UTC

[GitHub] [incubator-tvm] lsy643 opened a new pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

lsy643 opened a new pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108


   In this PR, the CUDA compute funtions of `get_valid_counts` and `nms`  are changed to make them work as expected.
   
   1. For `get_valid_counts`, only one thread is used for one image. I am not sure whether this is a good way
   2. For `nms`, there are two changes 
     2.1 make `box_indices` to map back to the original data indices
     2.2 create `rearrange_indices_out` for `nms` when `return_indices == True` 
   3. Test cases for gpu version of`get_valid_counts` and `nms` are enabled now


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-690903238


   @yongwww 
   I have added a test case for nms cuda version in `test_op_level5.p` with test data assumed getting from a `get_valid_count`.
   
   Since there is no `rearrange_indices_out ` for nms cuda version, I only compare it with the llvm verison
   1. For test data with shape `[1, 5, 6]`
   - cuda time: 90us
   - llvm time: 32us
   
   2 For test data with shape `[1, 20000, 6]`
   - cuda time: 6230us
   - llvm time: 219209us
   
   The inference time for llvm with large dataset is too large.
   
   Test data I use
   ```python
   
   data_length = 20000
   np_valid_count = np.array([20000]).astype("int32")
   v = []
   for i in range(20000):
       v.append(i)
   np_indices = np.array([v]).astype("int32")
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], [1, 0.7, 30, 60, 50, 80],
                           [0, 0.4, 4, 21, 19, 40], [2, 0.9, 35, 61, 52, 79],
                           [1, 0.5, 100, 60, 70, 110]]]).astype("float32")
   np_data = np_data.repeat(20000/5, axis=1)
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-680485005


   @yongwww @Laurawly 
   Sorry for the late response, I am being quite busy at work recently. I will try to fix the error and run the benchmark by the end of this week


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-683385549


   @yongwww @Laurawly 
   I am quite confused about the test data used in `test_non_max_suppression` from `tests/python/relay/test_op_level5.py`.
   
   If I understand correctly, a `get_valid_count`, a `non_max_suppression` and a `strided_slice` are used together as a `non_max_suppression` operator according to `def _nms()` from `frontend/tensorflow.py`. The `get_valid_counts` of the `cuda` version does not move valid boxes to the top of input data, while the `get_valid_counts` of the `cpu` version does the job.
   
   Therefore, it seems to make more sense if we use different test data for the `test_non_max_suppression`
   For example
   ``` python
   the original data before get_valid_counts
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], 
                        [1, 0.7, 30, 60, 50, 80],
                        [0, 0.4, 4, 21, 19, 40], 
                        [2, 0.9, 35, 61, 52, 79],
                        [1, 0.5, 100, 60, 70, 110]]]).astype("float32")
   ```
   
   ```python
   cpu test data after get_valid_counts
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], 
                        [1, 0.7, 1, 20, 25, 45],
                        [2, 0.9, 35, 61, 52, 79],
                        [1, 0.5, 100, 60, 70, 110], 
                        [-1, -1, -1, -1, -1, -1]]]).astype("float32")
   ```
   
   ```python
   cuda test data after get _valid_counts
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], 
                        [1, 0.7, 2, 21, 26, 45],
                        [-1, -1, -1, -1, -1, -1], 
                        [2, 0.9, 35, 61, 52, 79],
                        [1, 0.5, 100, 60, 70, 110]]]).astype("float32")
   
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] trevor-m commented on a change in pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
trevor-m commented on a change in pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#discussion_r459138745



##########
File path: tests/python/relay/test_op_level5.py
##########
@@ -270,9 +270,9 @@ def verify_get_valid_counts(dshape, score_threshold, id_index, score_index):
             intrp = relay.create_executor("debug", ctx=ctx, target=target)
             out = intrp.evaluate(func)(np_data)
             tvm.testing.assert_allclose(out[0].asnumpy(), np_out1, rtol=1e-3, atol=1e-04)
-            # get_valid_count for cuda, opencl doesn't do data rearrangement
-            if target in ['cuda', 'opencl']:
-                return
+            # get_valid_count for opencl doesn't do data rearrangement
+            if target in ['opencl']:

Review comment:
       OpenCL shares the cuda implementation, so you can enable this test too. The CI doesn't run opencl so please test it manually.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on a change in pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on a change in pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#discussion_r459178559



##########
File path: topi/python/topi/cuda/nms.py
##########
@@ -93,44 +93,41 @@ def get_valid_counts_ir(data, valid_count, out, out_indices,
     valid_count = ib.buffer_ptr(valid_count)
     out = ib.buffer_ptr(out)
     out_indices = ib.buffer_ptr(out_indices)
-    atomic_add_return = ib.allocate(
-        valid_count.dtype, (1,), name='atomic_add_return', scope='local')
     one_count = tvm.tir.const(1, dtype=valid_count.dtype)
     one = tvm.tir.const(1, dtype=out.dtype)
     score_threshold = tvm.ir.make_node(
         "FloatImm", dtype="float32", value=score_threshold)
     id_index = tvm.ir.make_node("IntImm", dtype="int32", value=id_index)
     score_index = tvm.ir.make_node("IntImm", dtype="int32", value=score_index)
 
-    max_threads = int(tvm.target.Target.current(
-        allow_none=False).max_num_threads)
-    nthread_tx = max_threads
-    nthread_bx = batch_size * num_anchors // max_threads + 1
+    nthread_tx = batch_size
+    nthread_bx = 1
     tx = te.thread_axis("threadIdx.x")
     bx = te.thread_axis("blockIdx.x")
     ib.scope_attr(tx, "thread_extent", nthread_tx)
     ib.scope_attr(bx, "thread_extent", nthread_bx)
-    tid = bx * max_threads + tx
-    idxd = tvm.tir.indexdiv
-
-    # initialize valid_count
-    with ib.if_scope(tid < batch_size):
-        valid_count[tid] = 0
-    with ib.if_scope(tid < batch_size * num_anchors):
-        i = idxd(tid, num_anchors)
+    tid = tx
+
+    # each thread process one batch
+    valid_count[tid] = 0
+    data_base_ind = tid * num_anchors * elem_length
+    ind_base_ind = tid * num_anchors
+    with ib.for_range(0, num_anchors) as anchor_ind:
+        with ib.for_range(0, elem_length) as k:
+            out[data_base_ind + anchor_ind * elem_length + k] = -one
+        out_indices[ind_base_ind + anchor_ind] = -one_count
+
+    with ib.for_range(0, num_anchors) as anchor_ind:
         with ib.if_scope(
-                tvm.tir.all(data[tid * elem_length + score_index] > score_threshold,
-                            tvm.tir.any(id_index < 0, data[tid * elem_length + id_index] >= 0))):
-            atomic_add_return[0] = atomic_add(tvm.tir.call_intrin("handle", "tir.address_of",

Review comment:
       Unused `atomic_add` definitions have been removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] Laurawly commented on a change in pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
Laurawly commented on a change in pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#discussion_r460479420



##########
File path: topi/python/topi/cuda/nms.py
##########
@@ -184,7 +155,84 @@ def get_valid_counts(data, score_threshold=0, id_index=0, score_index=1):
     return [valid_count, out, out_indices]
 
 
-def nms_ir(data, sorted_index, valid_count, out, box_indices,
+def rearrange_indices_out_ir(data, output, valid_box_count):
+    """Low level IR to get rearrange_indices_out.
+    Parameters
+    ----------
+    data : Buffer
+        Input data. 2-D Buffer with shape [batch_size, num_anchors].
+
+    output: Buffer
+        2-D Buffer with shape [batch_size, num_anchors].
+
+    valid_box_count : Buffer
+        2-D Buffer with shape [batch_size, 1].
+
+    Returns
+    -------
+    stmt : Stmt
+        The result IR statement.
+    """
+    batch_size = data.shape[0]
+    num_anchors = data.shape[1]
+    ib = tvm.tir.ir_builder.create()
+
+    data = ib.buffer_ptr(data)
+    output = ib.buffer_ptr(output)
+    valid_box_count = ib.buffer_ptr(valid_box_count)
+
+    nthread_tx = batch_size
+    nthread_bx = 1
+    tx = te.thread_axis("threadIdx.x")
+    bx = te.thread_axis("blockIdx.x")
+    ib.scope_attr(tx, "thread_extent", nthread_tx)
+    ib.scope_attr(bx, "thread_extent", nthread_bx)
+    tid = tx
+
+    valid_box_count[tid] = 0
+    with ib.for_range(0, num_anchors) as anchor_ind:
+        output[tid * num_anchors + anchor_ind] = data[tid * num_anchors + anchor_ind]
+    return ib.get()
+
+
+def rearrange_indices_out(data):

Review comment:
       This will regress performance a lot.

##########
File path: topi/python/topi/cuda/nms.py
##########
@@ -93,44 +67,41 @@ def get_valid_counts_ir(data, valid_count, out, out_indices,
     valid_count = ib.buffer_ptr(valid_count)
     out = ib.buffer_ptr(out)
     out_indices = ib.buffer_ptr(out_indices)
-    atomic_add_return = ib.allocate(
-        valid_count.dtype, (1,), name='atomic_add_return', scope='local')
     one_count = tvm.tir.const(1, dtype=valid_count.dtype)
     one = tvm.tir.const(1, dtype=out.dtype)
     score_threshold = tvm.ir.make_node(
         "FloatImm", dtype="float32", value=score_threshold)
     id_index = tvm.ir.make_node("IntImm", dtype="int32", value=id_index)
     score_index = tvm.ir.make_node("IntImm", dtype="int32", value=score_index)
 
-    max_threads = int(tvm.target.Target.current(
-        allow_none=False).max_num_threads)
-    nthread_tx = max_threads
-    nthread_bx = batch_size * num_anchors // max_threads + 1

Review comment:
       Only using one thread will regress the performance a lot. More benchmark should be shown according to previous PRs' workloads.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on a change in pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on a change in pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#discussion_r459178638



##########
File path: tests/python/relay/test_op_level5.py
##########
@@ -270,9 +270,9 @@ def verify_get_valid_counts(dshape, score_threshold, id_index, score_index):
             intrp = relay.create_executor("debug", ctx=ctx, target=target)
             out = intrp.evaluate(func)(np_data)
             tvm.testing.assert_allclose(out[0].asnumpy(), np_out1, rtol=1e-3, atol=1e-04)
-            # get_valid_count for cuda, opencl doesn't do data rearrangement
-            if target in ['cuda', 'opencl']:
-                return
+            # get_valid_count for opencl doesn't do data rearrangement
+            if target in ['opencl']:

Review comment:
       Test for `opencl` has been enabled.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-690903238






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-692416856


   @yongwww The `rearrange_indices_out` is a newly part added to `non_max_suppression` from `topi/cuda/nms.py`, and so I compare the latency of `non_max_suppression` when `return_indices=True` and `return_indices=False`.
   
   For test data with shape [1, 20000, 6]
   
   When  `return_indices=True`, the `rearrange_indices_out` is used
   - latency: 6342us
   
   When `return_indices=False`, the `rearrange_indices_out` is used
   - latency: 5645us


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 edited a comment on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 edited a comment on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-690903238


   @yongwww 
   I have added a test case for nms cuda version in `test_op_level5.p` with test data assumed getting from a `get_valid_count`.
   
   Since there is no `rearrange_indices_out ` for nms cuda version, I only compare it with the llvm verison
   1. For test data with shape `[1, 5, 6]`
   - cuda time: 90us
   - llvm time: 32us
   
   2 For test data with shape `[1, 20000, 6]`
   - cuda time: 6230us
   - llvm time: 219209us
   
   The inference time for llvm with large dataset is too large.
   
   Test data I use
   ```python
   
   data_length = 20000
   np_valid_count = np.array([20000]).astype("int32")
   v = []
   for i in range(20000):
       v.append(i)
   np_indices = np.array([v]).astype("int32")
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], [1, 0.7, 30, 60, 50, 80],
                           [0, 0.4, 4, 21, 19, 40], [2, 0.9, 35, 61, 52, 79],
                           [1, 0.5, 100, 60, 70, 110]]]).astype("float32")
   np_data = np_data.repeat(20000/5, axis=1)
   ```
   
   The compute and schedule functions I use
   ```
       use_cuda = False
       if use_cuda:
           device = 'cuda'
           fcompute = topi.cuda.non_max_suppression
           fschedule = topi.cuda.schedule_nms
       else:
           device = 'llvm'
           fcompute = topi.vision.non_max_suppression
           fschedule = topi.generic.schedule_nms
   
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] trevor-m commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
trevor-m commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-663152986


   IIRC, data arrangement was removed from get_valid_counts to improve performance because the data arrangement would be done by NMS anyway. Does this PR maintain the performance? @Laurawly 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] Laurawly commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
Laurawly commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-663936994


   I'm wondering what's the purpose of this PR. Currently, there's no correctness issue with `get_valid_counts` and `nms` in end-to-end performance of object detection related models using them. And based on the benchmarks shown in this [PR](https://github.com/apache/incubator-tvm/pull/5339), there's a lot of performance gain on recent changes. And as a summarization of that PR, we know data rearrangement on the GPU is super slow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] Laurawly commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
Laurawly commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-674562133


   > @trevor-m @yongwww @Laurawly
   > For the `get_valid_counts` part, I misunderstand it because I didn't understand `argsort` correctly and I have recovered to the original version, which is much faster.
   > 
   > For the `rearrange_indices_out` part, which is necessary because the result of `nms` is used by a `strided_slice` in `def _nms` in tensorflow frontent, I agree that the current way may regress the performance, but since we need to do data arrangement in this function, I can hardly figure out a better way to implement it.
   
   Could you show some benchmark numbers regarding the changes? @yongwww could have a better comment on the tensorflow related changes. Also it seems that there's illegal memory access error based on CI. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 edited a comment on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 edited a comment on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-683385549


   @yongwww @Laurawly 
   I am quite confused about the test data used in `test_non_max_suppression` from `tests/python/relay/test_op_level5.py`.
   
   If I understand correctly, a `get_valid_count`, a `non_max_suppression` and a `strided_slice` are used together as a `non_max_suppression` operator according to `def _nms()` from `frontend/tensorflow.py`. The `get_valid_counts` of the `cuda` version does not move valid boxes to the top of input data, while the `get_valid_counts` of the `cpu` version does the job.
   
   Therefore, it seems to make more sense if we use different test data for the `test_non_max_suppression`
   For example
   ``` python
   the original data before get_valid_counts
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], 
                        [1, 0.7, 30, 60, 50, 80],
                        [0, 0.4, 4, 21, 19, 40], 
                        [2, 0.9, 35, 61, 52, 79],
                        [1, 0.5, 100, 60, 70, 110]]]).astype("float32")
   
   
   ```
   
   ```python
   cpu test data after get_valid_counts
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], 
                        [1, 0.7, 1, 20, 25, 45],
                        [2, 0.9, 35, 61, 52, 79],
                        [1, 0.5, 100, 60, 70, 110], 
                        [-1, -1, -1, -1, -1, -1]]]).astype("float32")
   
   np_indices = np.array([[0, 1, 3, 4, -1]]).astype("int32")
   ```
   
   ```python
   cuda test data after get_valid_counts
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], 
                        [1, 0.7, 2, 21, 26, 45],
                        [-1, -1, -1, -1, -1, -1], 
                        [2, 0.9, 35, 61, 52, 79],
                        [1, 0.5, 100, 60, 70, 110]]]).astype("float32")
   
   np_indices = np.array([[0, 1, -1, 3, 4]]).astype("int32")
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-664079578


   @trevor-m @yongwww @Laurawly 
   For the `get_valid_counts` part, I misunderstand it because I didn't understand `argsort` correctly and I have recovered to the original version, which is much faster. 
   
   For the `rearrange_indices_out` part, which is necessary because the result of `nms` is used by a `strided_slice` in `def _nms` in tensorflow frontent, I agree that the current way may regress the performance, but since we need to do data arrangement in this function, I can hardly figure out a better way to implement it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] yongwww commented on a change in pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
yongwww commented on a change in pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#discussion_r460383539



##########
File path: topi/python/topi/cuda/nms.py
##########
@@ -93,44 +67,41 @@ def get_valid_counts_ir(data, valid_count, out, out_indices,
     valid_count = ib.buffer_ptr(valid_count)
     out = ib.buffer_ptr(out)
     out_indices = ib.buffer_ptr(out_indices)
-    atomic_add_return = ib.allocate(
-        valid_count.dtype, (1,), name='atomic_add_return', scope='local')
     one_count = tvm.tir.const(1, dtype=valid_count.dtype)
     one = tvm.tir.const(1, dtype=out.dtype)
     score_threshold = tvm.ir.make_node(
         "FloatImm", dtype="float32", value=score_threshold)
     id_index = tvm.ir.make_node("IntImm", dtype="int32", value=id_index)
     score_index = tvm.ir.make_node("IntImm", dtype="int32", value=score_index)
 
-    max_threads = int(tvm.target.Target.current(
-        allow_none=False).max_num_threads)
-    nthread_tx = max_threads
-    nthread_bx = batch_size * num_anchors // max_threads + 1

Review comment:
       Would like to know more about the reason behind the change, perhaps share some benchmark numbers?  In some scenarios, like TF MaskRCNN,  a large number of boxes (`num_anchors` > 20000) are in one batch, multiple threads here might provide performance improvement.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-662279185


   cc @yongwww  @Laurawly


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 edited a comment on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 edited a comment on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-690903238






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-690903238


   @yongwww 
   I have added a test case for nms cuda version in `test_op_level5.p` with test data assumed getting from a `get_valid_count`.
   
   Since there is no `rearrange_indices_out ` for nms cuda version, I only compare it with the llvm verison
   1. For test data with shape `[1, 5, 6]`
   - cuda time: 90us
   - llvm time: 32us
   
   2 For test data with shape `[1, 20000, 6]`
   - cuda time: 6230us
   - llvm time: 219209us
   
   The inference time for llvm with large dataset is too large.
   
   Test data I use
   ```python
   
   data_length = 20000
   np_valid_count = np.array([20000]).astype("int32")
   v = []
   for i in range(20000):
       v.append(i)
   np_indices = np.array([v]).astype("int32")
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], [1, 0.7, 30, 60, 50, 80],
                           [0, 0.4, 4, 21, 19, 40], [2, 0.9, 35, 61, 52, 79],
                           [1, 0.5, 100, 60, 70, 110]]]).astype("float32")
   np_data = np_data.repeat(20000/5, axis=1)
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] yongwww commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
yongwww commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-692398165


   @lsy643 thanks for sharing the results. What I am wondering is the latency of your change vs previous nms gpu version (even the output is not identical), and probably the perf number of your change vs TensorFlow baseline. As Leyuan mentioned above, the thread related change might cause performance regression, performance matters a lot for us, so we would like to see some perf number about this. If performance regression does exist, then it should be fixed. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on a change in pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on a change in pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#discussion_r460478875



##########
File path: topi/python/topi/cuda/nms.py
##########
@@ -93,44 +67,41 @@ def get_valid_counts_ir(data, valid_count, out, out_indices,
     valid_count = ib.buffer_ptr(valid_count)
     out = ib.buffer_ptr(out)
     out_indices = ib.buffer_ptr(out_indices)
-    atomic_add_return = ib.allocate(
-        valid_count.dtype, (1,), name='atomic_add_return', scope='local')
     one_count = tvm.tir.const(1, dtype=valid_count.dtype)
     one = tvm.tir.const(1, dtype=out.dtype)
     score_threshold = tvm.ir.make_node(
         "FloatImm", dtype="float32", value=score_threshold)
     id_index = tvm.ir.make_node("IntImm", dtype="int32", value=id_index)
     score_index = tvm.ir.make_node("IntImm", dtype="int32", value=score_index)
 
-    max_threads = int(tvm.target.Target.current(
-        allow_none=False).max_num_threads)
-    nthread_tx = max_threads
-    nthread_bx = batch_size * num_anchors // max_threads + 1

Review comment:
       It seems that the gpu `get_valid_counts` does not need to moves valid boxes to the top of input data because of the `argsort` in the `nms`, so using the original `get_valid_counts_ir` ought to be better. And do you think `rearrange_indices_out_ir` is the right way to do it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] yongwww commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
yongwww commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-687288778


   @lsy643  you are right, the auxiliary op get_valid_count and strided_slice are utilized to help handle TensorFlow dynamic NonMaximumSuppression. As a todo task, the cpu and gpu versions of the op are expected to behave consistently. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] trevor-m commented on a change in pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
trevor-m commented on a change in pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#discussion_r459138745



##########
File path: tests/python/relay/test_op_level5.py
##########
@@ -270,9 +270,9 @@ def verify_get_valid_counts(dshape, score_threshold, id_index, score_index):
             intrp = relay.create_executor("debug", ctx=ctx, target=target)
             out = intrp.evaluate(func)(np_data)
             tvm.testing.assert_allclose(out[0].asnumpy(), np_out1, rtol=1e-3, atol=1e-04)
-            # get_valid_count for cuda, opencl doesn't do data rearrangement
-            if target in ['cuda', 'opencl']:
-                return
+            # get_valid_count for opencl doesn't do data rearrangement
+            if target in ['opencl']:

Review comment:
       OpenCL shares the cuda implementation, so you can enable this test too.

##########
File path: topi/python/topi/cuda/nms.py
##########
@@ -93,44 +93,41 @@ def get_valid_counts_ir(data, valid_count, out, out_indices,
     valid_count = ib.buffer_ptr(valid_count)
     out = ib.buffer_ptr(out)
     out_indices = ib.buffer_ptr(out_indices)
-    atomic_add_return = ib.allocate(
-        valid_count.dtype, (1,), name='atomic_add_return', scope='local')
     one_count = tvm.tir.const(1, dtype=valid_count.dtype)
     one = tvm.tir.const(1, dtype=out.dtype)
     score_threshold = tvm.ir.make_node(
         "FloatImm", dtype="float32", value=score_threshold)
     id_index = tvm.ir.make_node("IntImm", dtype="int32", value=id_index)
     score_index = tvm.ir.make_node("IntImm", dtype="int32", value=score_index)
 
-    max_threads = int(tvm.target.Target.current(
-        allow_none=False).max_num_threads)
-    nthread_tx = max_threads
-    nthread_bx = batch_size * num_anchors // max_threads + 1
+    nthread_tx = batch_size
+    nthread_bx = 1
     tx = te.thread_axis("threadIdx.x")
     bx = te.thread_axis("blockIdx.x")
     ib.scope_attr(tx, "thread_extent", nthread_tx)
     ib.scope_attr(bx, "thread_extent", nthread_bx)
-    tid = bx * max_threads + tx
-    idxd = tvm.tir.indexdiv
-
-    # initialize valid_count
-    with ib.if_scope(tid < batch_size):
-        valid_count[tid] = 0
-    with ib.if_scope(tid < batch_size * num_anchors):
-        i = idxd(tid, num_anchors)
+    tid = tx
+
+    # each thread process one batch
+    valid_count[tid] = 0
+    data_base_ind = tid * num_anchors * elem_length
+    ind_base_ind = tid * num_anchors
+    with ib.for_range(0, num_anchors) as anchor_ind:
+        with ib.for_range(0, elem_length) as k:
+            out[data_base_ind + anchor_ind * elem_length + k] = -one
+        out_indices[ind_base_ind + anchor_ind] = -one_count
+
+    with ib.for_range(0, num_anchors) as anchor_ind:
         with ib.if_scope(
-                tvm.tir.all(data[tid * elem_length + score_index] > score_threshold,
-                            tvm.tir.any(id_index < 0, data[tid * elem_length + id_index] >= 0))):
-            atomic_add_return[0] = atomic_add(tvm.tir.call_intrin("handle", "tir.address_of",

Review comment:
       Since we are no longer using atomic_add, should we remove those intrinsic definitions?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] yongwww commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
yongwww commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-663828876


   @lsy643 Regarding the thread change, could you please benchmark the performance before and after your change and share the numbers? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-690903238






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] Laurawly commented on a change in pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
Laurawly commented on a change in pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#discussion_r460479420



##########
File path: topi/python/topi/cuda/nms.py
##########
@@ -184,7 +155,84 @@ def get_valid_counts(data, score_threshold=0, id_index=0, score_index=1):
     return [valid_count, out, out_indices]
 
 
-def nms_ir(data, sorted_index, valid_count, out, box_indices,
+def rearrange_indices_out_ir(data, output, valid_box_count):
+    """Low level IR to get rearrange_indices_out.
+    Parameters
+    ----------
+    data : Buffer
+        Input data. 2-D Buffer with shape [batch_size, num_anchors].
+
+    output: Buffer
+        2-D Buffer with shape [batch_size, num_anchors].
+
+    valid_box_count : Buffer
+        2-D Buffer with shape [batch_size, 1].
+
+    Returns
+    -------
+    stmt : Stmt
+        The result IR statement.
+    """
+    batch_size = data.shape[0]
+    num_anchors = data.shape[1]
+    ib = tvm.tir.ir_builder.create()
+
+    data = ib.buffer_ptr(data)
+    output = ib.buffer_ptr(output)
+    valid_box_count = ib.buffer_ptr(valid_box_count)
+
+    nthread_tx = batch_size
+    nthread_bx = 1
+    tx = te.thread_axis("threadIdx.x")
+    bx = te.thread_axis("blockIdx.x")
+    ib.scope_attr(tx, "thread_extent", nthread_tx)
+    ib.scope_attr(bx, "thread_extent", nthread_bx)
+    tid = tx
+
+    valid_box_count[tid] = 0
+    with ib.for_range(0, num_anchors) as anchor_ind:
+        output[tid * num_anchors + anchor_ind] = data[tid * num_anchors + anchor_ind]
+    return ib.get()
+
+
+def rearrange_indices_out(data):

Review comment:
       This will regress performance a lot, don't recommend.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] yongwww commented on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
yongwww commented on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-678908663


   @lsy643 the `rearrange_indices_out` part you updated looks good to me. Currently I am concerned about the thread related change, since the change might cause some performance regression, especially for scenarios with >20k inputs boxes. It would great to show some benchmark numbers for the change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] lsy643 edited a comment on pull request #6108: Fix CUDA Compute Function For `get_valid_counts` and `nms`

Posted by GitBox <gi...@apache.org>.
lsy643 edited a comment on pull request #6108:
URL: https://github.com/apache/incubator-tvm/pull/6108#issuecomment-690903238


   @yongwww 
   I have added a test case for nms cuda version in `test_op_level5.p` with test data assumed getting from a `get_valid_count`.
   
   Since there is no `rearrange_indices_out ` for nms cuda version, I only compare it with the llvm verison
   1. For test data with shape `[1, 5, 6]`
   - cuda time: 90us
   - llvm time: 32us
   
   2 For test data with shape `[1, 20000, 6]`
   - cuda time: 6230us
   - llvm time: 219209us
   
   The inference time for llvm with large dataset is too large.
   
   Test data I use
   ```python
   
   data_length = 20000
   np_valid_count = np.array([20000]).astype("int32")
   v = []
   for i in range(20000):
       v.append(i)
   np_indices = np.array([v]).astype("int32")
   
   np_data = np.array([[[0, 0.8, 1, 20, 25, 45], [1, 0.7, 30, 60, 50, 80],
                           [0, 0.4, 4, 21, 19, 40], [2, 0.9, 35, 61, 52, 79],
                           [1, 0.5, 100, 60, 70, 110]]]).astype("float32")
   np_data = np_data.repeat(20000/5, axis=1)
   ```
   
   The compute and schedule functions I use
   ```
       use_cuda = False
       if use_cuda:
           device = 'cuda'
           fcompute = topi.cuda.non_max_suppression
           fschedule = topi.cuda.schedule_nms
       else:
           device = 'llvm'
           fcompute = topi.vision.non_max_suppression
           fschedule = topi.generic.schedule_nms
   
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org