You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/12/22 23:37:46 UTC

[GitHub] [tvm] masahi opened a new pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

masahi opened a new pull request #7154:
URL: https://github.com/apache/tvm/pull/7154


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] mbrookhart commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

mbrookhart commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548038954



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       What I do in the tests it write a reference version of the graph I want as the output and then assert that they match, see for instance https://github.com/apache/tvm/blob/e51bcdd1ed99fc813454c68911f8032c852f48b4/tests/python/relay/test_dataflow_pattern.py#L801-L815




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] zhiics commented on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

zhiics commented on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-749858927


   @masahi I think this is an plausible as well particularly it is only in the parser. @kevinthesun please help take a look as well. Thanks.   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-758550210


   @kevinthesun @zhiics @mbrookhart 
   
   As shown in my new NMS PR https://github.com/apache/tvm/pull/7257, this rewrite results in a better speed up with improved memory layout. Can we merge this? I have newer rewrites coming to further optimize PyTorch NMS and MaskRCNN / FasterRCNN


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] mbrookhart edited a comment on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

mbrookhart edited a comment on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-750926391


   Sorry for the delay in responding to this, I wanted to look at the frameworks more closely. We currently have 5 importers that leverage NMS:
   
   MXNET does multibox_transform_loc and then NMS on the outputs. multi_box_transform loc converts a 3D array of scores with shape (batch_size, class_num, num_anchors) into a most likely class and score for that class, plus does some coordinate transforms on the box
   
   ONNX takes a 3D tensor of (batch_size, class, num_anchors), does slicing/concatenating with the boxes, and then does a per-class get_valid_counts->non_max
   
   Pytorch takes in a 1D tensor of scores and concats it with the boxes before performing get_valid_counts and nms. As @masahi shows in this PR, there is preprocessing to embed all classes into that 1D tensor.
   
   TF takes a 1D tensor of scores and concats it to the boxes before performing get_valid_counts and nms. I'm not sure if the rest of the TF graph is handling the loop over batch size and classes.
   
   TFlite takes a 3D score tensor of shape (batch size, num_anchors, class_id), reorders it to (batch_size, class_id, num_anchors), performs multibox_transform_loc->nms, and strangely does get_valid_counts after NMS. 
   
   It looks like we're doing pre-processing in every framework to reduce the amount of score information and convert it to the 5 or 6 D form the nms API wants. None of the frameworks give us inputs in the packed form the API expets, and we jump through hoops in every importer to convert inputs into that form. T hen in at least TFLite and ONNX, we perform further splitting/slicing/concatenating to restore the separate class ids. 
   
   I think I agree with @masahi, we seem to be jumping through a lot of hoops in the importers to support a TVM NMS API that's out of line with the frameworks, and that might be hurting our overall performance.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] mbrookhart edited a comment on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

mbrookhart edited a comment on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-750926391


   Sorry for the delay in responding to this, I wanted to look at the frameworks more closely. We currently have 5 importers that leverage NMS:
   
   MXNET does multibox_transform_loc and then NMS on the outputs. multi_box_transform_loc converts a 3D array of scores with shape (batch_size, class_num, num_anchors) into a most likely class and score for that class, plus does some coordinate transforms on the box.
   
   ONNX takes a 3D tensor of (batch_size, class, num_anchors), does slicing/concatenating with the boxes, and then does a per-class get_valid_counts->non_max.
   
   Pytorch takes in a 1D tensor of scores and concats it with the boxes before performing get_valid_counts and nms. As @masahi shows in this PR, there is preprocessing to embed all classes into that 1D tensor outside of the op.
   
   TF takes a 1D tensor of scores and concats it to the boxes before performing get_valid_counts and nms. I'm not sure if the rest of the TF graph is handling the loop over batch size and classes.
   
   TFlite takes a 3D score tensor of shape (batch size, num_anchors, class_id), reorders it to (batch_size, class_id, num_anchors), performs multibox_transform_loc->nms, and strangely does get_valid_counts after NMS. 
   
   It looks like we're doing pre-processing in every framework to reduce the amount of score information and convert it to the 5 or 6 D form the nms API wants. None of the frameworks give us inputs in the packed form the API expects, and we jump through hoops in every importer to convert inputs into that form. Then in at least TFLite and ONNX, we perform further splitting/slicing/concatenating to restore the separate class ids. 
   
   I think I agree with @masahi, we seem to be jumping through a lot of hoops in the importers to support a TVM NMS API that's out of line with the frameworks, and that might be hurting our overall performance.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi edited a comment on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi edited a comment on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-758550210


   @kevinthesun @zhiics @mbrookhart 
   
   As shown in my new NMS PR https://github.com/apache/tvm/pull/7257, this rewrite results in a better speed up with improved memory layout. Can we merge this? I have new rewrites coming to further optimize PyTorch NMS and MaskRCNN / FasterRCNN.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] zhiics merged pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

zhiics merged pull request #7154:
URL: https://github.com/apache/tvm/pull/7154


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi edited a comment on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi edited a comment on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-750589039


   > When we do the nested loop, we always have to check if the id of instance k matches the id of instance j. Since the input shape is (batch_size, num_anchors, features), and features = 6 here, I wouldn't be surprised if checking the instance of k ends up reading all of the features of k into registers, and that memory read is the expensive operation. Once it's in memory, actually doing the iou calculation is relatively cheap, so skipping it doesn't help that much.
   
   That's highly possible. Looking at this if condition in the triangle inner loop:
   https://github.com/apache/tvm/blob/9956b5b859a24c8f4dec1abf308723b1257ffc66/python/tvm/topi/cuda/nms.py#L535-L540
   
   previously, `force_suppress` is always True, so this condition short circuit and access to `out[base_idx + offset_k + id_index]` and `out[base_idx + offset_j + id_index]` just below never happen. But now, to make NMS class aware, I had to change `force_suppress` to False, and now access to `out[base_idx + offset_k + id_index]` and `out[base_idx + offset_j + id_index]` always happen. This may be cancelling the speedup from reduced IOU tests. Storing the class IDs in a different 1D tensor may help. 
   
   That brings me to my pain point with our NMS API: I belieave our NMS API needs to be reworked. The current way of packing class ids and scores together with bbox coordinates is a design mistake that we inherited from MXNet. To store class ids, I have to cast ids to float32, update and pass `id_index` appropriately. Since our NMS API also requires scores to be packed with bbox, all frontends except MXNet needs to do this concatenation. The worst part is that in NMS IR, the very first thing we do is the extraction 1D score tensor from packed data. So I see no good reason to pack score tensor and bbox coordinates.
                               


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] trevor-m commented on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

trevor-m commented on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-758329448


   I highly agree with you guys.
   
   For class-aware NMS, the [batch, num_anchors, 6] format seems very inefficient. It means all anchors need to be checked just to see if the classes match. A [batch, num_classes, num_anchors, 5] format would give us a nicely defined slice of memory where the same-class anchors are located.
   
   > TF takes a 1D tensor of scores and concats it to the boxes before performing get_valid_counts and nms. I'm not sure if the rest of the TF graph is handling the loop over batch size and classes.
   
   That's correct, TF's NMS is only for single class and single batch, so the TF graph loops over batches and classes. To do that, they use [tf.map_fn](https://www.tensorflow.org/api_docs/python/tf/map_fn) so the execution of each NMS can actually still run in parallel. However, this turns into a mess of control flow operators and TensorArrays, so Relay isn't able to do the same parallelization. This PR's graph rewrite could actually benefit TF OD models as well, but the pattern is a lot more complicated for TF.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] mbrookhart commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

mbrookhart commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548220676



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       I think this is fine for an integration test. If you want to write a unit test for your rewrite, I would suggest running it on a minimal example. I don't think it's strictly necessary.
   
   What would you imagine a search and not-rewrite should return? You could test the pattern serach part by adding a throw into your callback and check that the raised error matches.

##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       I think this is fine for an integration test. If you want to write a unit test for your rewrite, I would suggest running it on a minimal example. I don't think it's strictly necessary.
   
   What would you imagine a search and not-rewrite should return? You could test the pattern search part by adding a throw into your callback and check that the raised error matches.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548228717



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       > I think this is fine for an integration test. If you want to write a unit test for your rewrite, I would suggest running it on a minimal example. I don't think it's strictly necessary.
   
   Ok that makes sense. I'm fine with what I have now. Removed TODO above.
   
   I imagine search and not-rewrite would simply return a boolean and what you described would work as a quick implementation.     




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi edited a comment on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi edited a comment on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-750589039


   > When we do the nested loop, we always have to check if the id of instance k matches the id of instance j. Since the input shape is (batch_size, num_anchors, features), and features = 6 here, I wouldn't be surprised if checking the instance of k ends up reading all of the features of k into registers, and that memory read is the expensive operation. Once it's in memory, actually doing the iou calculation is relatively cheap, so skipping it doesn't help that much.
   
   That's highly possible. Looking at this if condition in the triangle inner loop:
   https://github.com/apache/tvm/blob/9956b5b859a24c8f4dec1abf308723b1257ffc66/python/tvm/topi/cuda/nms.py#L535-L540
   
   previously, `force_suppress` is always True, so this condition short circuit and access to `out[base_idx + offset_k + id_index]` and `out[base_idx + offset_j + id_index]` just below never happen. But now, to make NMS class aware, I had to change `force_suppress` to False, and now access to `out[base_idx + offset_k + id_index]` and `out[base_idx + offset_j + id_index]` always happen. This may be cancelling the speedup from reduced IOU tests. Storing the class IDs in a different 1D tensor may help. 
   
   That brings me to one of my pain points with our NMS API: I belieave our NMS API needs to be reworked. The current way of packing class ids and scores together with bbox coordinates is a design mistake that we inherited from MXNet. To store class ids, I have to cast ids to float32, update and pass `id_index` appropriately. Since our NMS API also requires scores to be packed with bbox, I had to update `score_index` too and all frontends except MXNet needs to do this concatenation. The worst part is that in NMS IR, the very first thing we do is the extraction 1D score tensor from packed data. So I see no good reason to pack score tensor and bbox coordinates.
                               


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] mbrookhart commented on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

mbrookhart commented on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-750376423


   This is a bit of a shot in the dark.
   
   I wonder if we're memory access limited, and so that's why you don't see a performance improvement.
   
   When we do the nested loop, we always have to check if the id of instance k matches the id of instance j. Since the input shape is (batch_size, num_anchors, features), and features = 6 here, I wouldn't be surprised if checking the instance of k ends up reading all of the features of k into registers, and that memory read is the expensive operation. Once it's in memory, actually doing the iou calculation is relatively cheap, so skipping it doesn't help that much.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] kevinthesun commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

kevinthesun commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r547594012



##########
File path: python/tvm/relay/frontend/pytorch_utils.py
##########
@@ -25,3 +35,98 @@ def is_version_greater_than(ver):
     return "".join(re.findall(r"(\d+\.)(\d+\.)(\d)", torch.__version__)[0]) > "".join(
         re.findall(r"(\d+\.)(\d+\.)(\d)", ver)[0]
     )
+
+
+def batched_nms_pattern(boxes, scores, idxs, iou_threshold):
+    """A pattern to detect batched_nms function in torchvision"""

Review comment:
       Can we have more comments about this pattern matching processing? I'm a bit confused of how class id is restored.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548183138



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       Yes, the problem is that the model is huge so manually creating the reference is not possible. Programmatically creating a reference requires the same pattern match & rewrite I want to test :)
   
   We have search-and-rewrite, but I think it would also be nice to have search-but-not-rewrite, for use case like this. Just like regex.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-749858279


   @zhiics Sure updated the description. The regression is only 200 us on CPU, so it may be just a measurement noise.
   
   I have no idea why I'm not getting good speed up. IOU tests, including memory access to boxes should be definitely reduced. The only additional overhead I think of is that the input to NMS is one column wider, due to storing class ids. 
   
   Performance is not great, but I believe having access to class ids should not be bad idea...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548183138



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       Yes, the problem is that the model is huge so manually creating the reference is not possible. Programmatically creating a reference requires the same pattern match & rewrite I want to test :)
   
   I think search but not rewrite/replace functionality would be nice to have, for use case like this. Just like regex.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548183138



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       Yes, the problem is that the model is huge so manually creating the reference is not possible. Programmatically creating a reference requires the same pattern match & rewrite I want to test :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-750589039


   > When we do the nested loop, we always have to check if the id of instance k matches the id of instance j. Since the input shape is (batch_size, num_anchors, features), and features = 6 here, I wouldn't be surprised if checking the instance of k ends up reading all of the features of k into registers, and that memory read is the expensive operation. Once it's in memory, actually doing the iou calculation is relatively cheap, so skipping it doesn't help that much.
   
   That's highly possible. Looking at this if condition in the triangle inner loop:
   https://github.com/apache/tvm/blob/9956b5b859a24c8f4dec1abf308723b1257ffc66/python/tvm/topi/cuda/nms.py#L535-L540
   
   previously, `force_suppress` is always True, so this condition short circuit and access to `out[base_idx + offset_k + id_index]` and `out[base_idx + offset_j + id_index]` just below never happen. But now, to make NMS class aware, I had to change `force_suppress` to False, and now access to `out[base_idx + offset_k + id_index]` and `out[base_idx + offset_j + id_index]` always happen. This may be cancelling the speedup from reduced IOU tests. Storing the class IDs in a different 1D tensor may help. 
   
   That brings me to my pain point with our NMS API: I belieave our NMS API needs to be reworked. The current way of packing class ids and scores together with bbox coordinates is a design mistake that we inherited from MXNet. To store class ids, I have to cast ids to float32 and update and pass `id_index` appropriately. Since our NMS API also requires scores to be packed with bbox, all frontends except MXNet needs to do this concatenation. The worst part is that in NMS IR, the very first thing we do is the extraction 1D score tensor from packed data. So I see no good reason to pack score tensor and bbox coordinates.
                               


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-749860080


   I should mention that this rewrite is not run by default, so there is no perf risk.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548204574



##########
File path: python/tvm/relay/frontend/pytorch_utils.py
##########
@@ -25,3 +35,98 @@ def is_version_greater_than(ver):
     return "".join(re.findall(r"(\d+\.)(\d+\.)(\d)", torch.__version__)[0]) > "".join(
         re.findall(r"(\d+\.)(\d+\.)(\d)", ver)[0]
     )
+
+
+def batched_nms_pattern(boxes, scores, idxs, iou_threshold):
+    """A pattern to detect batched_nms function in torchvision"""

Review comment:
       I added some comments to explain the pattern. It should be clear now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] mbrookhart commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

mbrookhart commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548046623



##########
File path: python/tvm/relay/frontend/pytorch_utils.py
##########
@@ -25,3 +35,98 @@ def is_version_greater_than(ver):
     return "".join(re.findall(r"(\d+\.)(\d+\.)(\d)", torch.__version__)[0]) > "".join(
         re.findall(r"(\d+\.)(\d+\.)(\d)", ver)[0]
     )
+
+
+def batched_nms_pattern(boxes, scores, idxs, iou_threshold):
+    """A pattern to detect batched_nms function in torchvision"""

Review comment:
       I think I understand what the rewritten pattern does, but I'm not understanding how the pytorch version works.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548183138



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       Yes, the problem is that the model is huge so manually creating the reference is not possible. Programmatically creating a reference requires the same pattern match & rewrite I want to test :)
   
   I think search but not rewrite functionality would be nice to have. Just like regex.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r548183138



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       Yes, the problem is the model is huge so manually creating the reference is not possible. Programmatically creating a reference requires the same pattern match & rewrite I want to test :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi edited a comment on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi edited a comment on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-749858279


   @zhiics Sure updated the description. Unfortunately I cannot claim that this is perf improvement. The regression is only 200 us on CPU, so it may be just a measurement noise, though.
   
   I have no idea why I'm not getting good speed up. IOU tests, including memory access to boxes should be definitely reduced. The only additional overhead I think of is that the input to NMS is one column wider, due to storing class ids. 
   
   Performance is not great, but I believe having access to class ids should not be a bad idea...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r547566314



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       @mbrookhart Any suggestion here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r547566314



##########
File path: tests/python/frontend/pytorch/test_object_detection.py
##########
@@ -102,38 +105,55 @@ def test_detection_models():
     scripted_model = generate_jit_model(1)
     mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
 
-    with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
-        vm_exec = relay.vm.compile(mod, target=target, params=params)
+    def compile_and_run_vm(mod, params, data_np):
+        with tvm.transform.PassContext(opt_level=3, disabled_pass=["FoldScaleAxis"]):
+            vm_exec = relay.vm.compile(mod, target=target, params=params)
 
-    ctx = tvm.cpu()
-    vm = VirtualMachine(vm_exec, ctx)
-    data = process_image(img)
-    pt_res = scripted_model(data)
-    data = data.detach().numpy()
-    vm.set_input("main", **{input_name: data})
-    tvm_res = vm.run()
+        ctx = tvm.context(target, 0)
+        vm = VirtualMachine(vm_exec, ctx)
+        vm.set_input("main", **{input_name: data_np})
+        return vm.run()
 
+    data = process_image(img)
+    data_np = data.detach().numpy()
+    tvm_res = compile_and_run_vm(mod, params, data_np)
     # Note: due to accumulated numerical error, we can't directly compare results
     # with pytorch output. Some boxes might have a quite tiny difference in score
     # and the order can become different. We just measure how many valid boxes
     # there are for input image.
+    pt_res = scripted_model(data)
     pt_scores = pt_res[1].detach().numpy().tolist()
     tvm_scores = tvm_res[1].asnumpy().tolist()
-    num_pt_valid_scores = num_tvm_valid_scores = 0
 
-    for score in pt_scores:
-        if score >= score_threshold:
-            num_pt_valid_scores += 1
-        else:
-            break
+    def count_valid_scores(scores):
+        num_valid_scores = 0
+        for score in pt_scores:
+            if score >= score_threshold:
+                num_valid_scores += 1
+            else:
+                return num_valid_scores
 
-    for score in tvm_scores:
-        if score >= score_threshold:
-            num_tvm_valid_scores += 1
-        else:
-            break
+    num_pt_valid_scores = count_valid_scores(pt_scores)
+    num_tvm_valid_scores = count_valid_scores(tvm_scores)
 
     assert num_pt_valid_scores == num_tvm_valid_scores, (
         "Output mismatch: Under score threshold {}, Pytorch has {} valid "
         "boxes while TVM has {}.".format(score_threshold, num_pt_valid_scores, num_tvm_valid_scores)
     )
+
+    before = mod["main"]
+    after = rewrite(NMSRewrite(), before)
+    # TODO(masahi): Is there a better way to test if the desired rewrite has happened?

Review comment:
       @mbrookhart Any suggestion here? Specifically, I am looking for search and match function.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#discussion_r547615181



##########
File path: python/tvm/relay/frontend/pytorch_utils.py
##########
@@ -25,3 +35,98 @@ def is_version_greater_than(ver):
     return "".join(re.findall(r"(\d+\.)(\d+\.)(\d)", torch.__version__)[0]) > "".join(
         re.findall(r"(\d+\.)(\d+\.)(\d)", ver)[0]
     )
+
+
+def batched_nms_pattern(boxes, scores, idxs, iou_threshold):
+    """A pattern to detect batched_nms function in torchvision"""

Review comment:
       Ok I'll update after the CI finishes. In the mean time, you can have a look at rewrite test cases in https://github.com/apache/tvm/blob/main/tests/python/relay/test_dataflow_pattern.py#L698-L736 to get an idea of how they work




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] zhiics commented on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

zhiics commented on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-749852977


   @masahi Thanks for the perf improvement. Could you provide the CPU numbers as well?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi edited a comment on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

masahi edited a comment on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-750589039


   > When we do the nested loop, we always have to check if the id of instance k matches the id of instance j. Since the input shape is (batch_size, num_anchors, features), and features = 6 here, I wouldn't be surprised if checking the instance of k ends up reading all of the features of k into registers, and that memory read is the expensive operation. Once it's in memory, actually doing the iou calculation is relatively cheap, so skipping it doesn't help that much.
   
   That's highly possible. Looking at this if condition in the triangle inner loop:
   https://github.com/apache/tvm/blob/9956b5b859a24c8f4dec1abf308723b1257ffc66/python/tvm/topi/cuda/nms.py#L535-L540
   
   previously, `force_suppress` is always True, so this condition short circuit and access to `out[base_idx + offset_k + id_index]` and `out[base_idx + offset_j + id_index]` just below never happen. But now, to make NMS class aware, I had to change `force_suppress` to False, and now access to `out[base_idx + offset_k + id_index]` and `out[base_idx + offset_j + id_index]` always happen. This may be cancelling the speedup from reduced IOU tests. Storing the class IDs in a different 1D tensor may help. 
   
   That brings me to my pain point with our NMS API: I belieave our NMS API needs to be reworked. The current way of packing class ids and scores together with bbox coordinates is a design mistake that we inherited from MXNet. To store class ids, I have to cast ids to float32, update and pass `id_index` appropriately. Since our NMS API also requires scores to be packed with bbox, I had to update `score_index` too and all frontends except MXNet needs to do this concatenation. The worst part is that in NMS IR, the very first thing we do is the extraction 1D score tensor from packed data. So I see no good reason to pack score tensor and bbox coordinates.
                               


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] mbrookhart commented on pull request #7154: [Torch] Restore class-aware NMS for detection models by graph rewrite

Posted by GitBox <gi...@apache.org>.

mbrookhart commented on pull request #7154:
URL: https://github.com/apache/tvm/pull/7154#issuecomment-750926391


   Sorry for the delay in responding to this, I wanted to look at the frameworks more closely. We currently have 5 importers that leverage NMS:
   
   MXNET does multibox_transform_loc and then NMS on the outputs. multi_box_transform loc converts a 3D array of scores with shape (batch_size, class_num, num_anchors) into a most likely class and score for that class, plus does some coordinate transforms on the box
   
   ONNX takes a 3D tensor of (batch_size, class, num_anchors), does slicing/concatenating with the boxes, and then does a per-class get_valid_counts->non_max
   
   Pytorch takes in a 1D tensor of scores and concats it with the boxes before performing get_valid_counts and nms. As @masahi shows in this PR, there is preprocessing to embed all classes into that 1D tensor.
   
   TF takes a 1D tensor of scores and concats it to the boxes before performing get_valid_counts and nms. I'm not sure if the rest of the TF graph is handling the loop over batch size and classes.
   
   TFlite takes a 3D score tensor of shape (batch size, num_anchors, class_id), reorders it to (batch_size, class_id, num_anchors), performs multibox_transform_loc->nms, and strangely does get_valid_counts after NMS. 
   
   It looks like we're doing pre-processing in every framework to reduce the amount of score information and convert it to the 5 or 6 D form the nms API wants, and then in at least TFLite and ONNX, we perform further spliting/slicing/concatenating to restore the separate class ids. 
   
   I think I agree with @masahi, we seem to be jumping through a lot of hoops in the importers to support a TVM NMS API that's out of line with the frameworks, and that might be hurting our overall performance.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org