You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/11/03 22:38:04 UTC

[GitHub] [incubator-tvm] alexgl-github opened a new pull request #6840: conv1d_transpose speedup

alexgl-github opened a new pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840


   Thanks for contributing to TVM!   Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from [Reviewers](https://github.com/apache/incubator-tvm/blob/master/CONTRIBUTORS.md#reviewers) by @ them in the pull request thread.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] alexgl-github edited a comment on pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
alexgl-github edited a comment on pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#issuecomment-721412452


   Speedup transposed 1d convolution by eliminating unnecessary multiplications by data values which contain zeroes, for stride greater then 1. 
   
   Below are current vs proposed latency numbers for various transposed conv1d parameters:
   `
   latency current=0.06837 sec  new=0.00049 sec   channels_out=   1 input_shape=(1, 257, 128) kernel_size=512 strides=128 padding=256
   latency current=0.53093 sec  new=0.10341 sec   channels_out= 257 input_shape=(1, 257, 128) kernel_size=512 strides=4 padding=256
   latency current=0.00292 sec  new=0.00307 sec   channels_out=   1 input_shape=(1, 257, 128) kernel_size=512 strides=1 padding=256
   latency current=0.00474 sec  new=0.00171 sec   channels_out=   1 input_shape=(1, 257, 128) kernel_size=512 strides=2 padding=256
   latency current=0.00955 sec  new=0.00056 sec   channels_out=   1 input_shape=(1, 257, 128) kernel_size=512 strides=16 padding current=256
   latency current=0.00054 sec  new=0.00023 sec   channels_out=   1 input_shape=(1, 1, 16384) kernel_size=512 strides=2 padding=256
   latency current=0.00385 sec  new=0.00089 sec   channels_out=   4 input_shape=(1, 1, 16384) kernel_size=512 strides=4 padding=256
   latency current=0.00013 sec  new=0.00004 sec   channels_out=   1 input_shape=(1, 1, 1024) kernel_size=512 strides=5 padding=256
   latency current=0.00002 sec  new=0.00002 sec   channels_out=  32 input_shape=(1, 3, 224) kernel_size=5 strides=1 padding=0
   latency current=0.00004 sec  new=0.00003 sec   channels_out=  32 input_shape=(1, 3, 224) kernel_size=5 strides=2 padding=0
   latency current=0.00006 sec  new=0.00003 sec   channels_out= 128 input_shape=(1, 32, 32) kernel_size=5 strides=2 padding=0
   
   `
     


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] vinx13 commented on pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
vinx13 commented on pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#issuecomment-723379937


   Thanks @alexgl-github @anijain2305 @giuseros 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] anijain2305 commented on pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
anijain2305 commented on pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#issuecomment-721438682


   @vinx13 Can you PTAL?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] alexgl-github commented on a change in pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
alexgl-github commented on a change in pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#discussion_r517582147



##########
File path: python/tvm/topi/cuda/conv1d_transpose_ncw.py
##########
@@ -65,36 +66,43 @@ def conv1d_transpose_ncw(cfg, data, kernel, stride, padding, out_dtype, output_p
     out_width = (inp_width - 1) * stride + kernel_size - pad_left - pad_right + output_padding
     pad_left = kernel_size - 1 - pad_left
     pad_right = kernel_size - 1 - pad_right + output_padding
+    padded_width = pad_left + inp_width + pad_right
     dilated_width = stride * (inp_width - 1) + 1
-    data = te.compute(
-        (batch, inp_channels, pad_left + dilated_width + pad_right),
+    padded_dilated_width = pad_left + dilated_width + pad_right
+
+    padded_data = te.compute(
+        (batch, inp_channels, padded_width),
         lambda n, c, x: tvm.tir.if_then_else(
-            tvm.tir.all(
-                x >= pad_left,
-                x < pad_left + dilated_width,
-                tvm.tir.indexmod(x - pad_left, stride).equal(0),
-            ),
-            data[n, c, tvm.tir.indexdiv(x - pad_left, stride)],
-            tvm.tir.const(0.0, "float32"),
-        ),
-        name="data_pad",
-    )
-
-    dc = te.reduce_axis((0, inp_channels), name="dc")
-    dw = te.reduce_axis((0, kernel_size), name="dw")
+            tvm.tir.all(x >= pad_left,
+                        x < pad_left + inp_width),
+            data[n, c, x - pad_left],
+            tvm.tir.const(0., "float32")),
+        name='data_pad')
+
+    padded_kernel = te.compute(
+        (inp_channels, out_channels, kernel_size + stride - 1),
+        lambda ci, co, k: tvm.tir.if_then_else(
+            tvm.tir.all(k < kernel_size),
+            kernel[ci, co, kernel_size-k-1],
+            tvm.tir.const(0., "float32")),
+        name='kernel_pad')
+
+    ci = te.reduce_axis((0, inp_channels), name='ci')
+    k = te.reduce_axis((0, (kernel_size + stride - 1)//stride), name='k')

Review comment:
       I don't think ceil of floating point division can be used in reduce_axis. There.s no integer equivalent in tir.op, i've replaced it with tir.indexdiv(...)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] alexgl-github commented on a change in pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
alexgl-github commented on a change in pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#discussion_r517581190



##########
File path: python/tvm/topi/cuda/conv1d_transpose_ncw.py
##########
@@ -65,36 +66,43 @@ def conv1d_transpose_ncw(cfg, data, kernel, stride, padding, out_dtype, output_p
     out_width = (inp_width - 1) * stride + kernel_size - pad_left - pad_right + output_padding
     pad_left = kernel_size - 1 - pad_left
     pad_right = kernel_size - 1 - pad_right + output_padding
+    padded_width = pad_left + inp_width + pad_right
     dilated_width = stride * (inp_width - 1) + 1
-    data = te.compute(
-        (batch, inp_channels, pad_left + dilated_width + pad_right),
+    padded_dilated_width = pad_left + dilated_width + pad_right
+
+    padded_data = te.compute(
+        (batch, inp_channels, padded_width),
         lambda n, c, x: tvm.tir.if_then_else(
-            tvm.tir.all(
-                x >= pad_left,
-                x < pad_left + dilated_width,
-                tvm.tir.indexmod(x - pad_left, stride).equal(0),
-            ),
-            data[n, c, tvm.tir.indexdiv(x - pad_left, stride)],
-            tvm.tir.const(0.0, "float32"),
-        ),
-        name="data_pad",
-    )
-
-    dc = te.reduce_axis((0, inp_channels), name="dc")
-    dw = te.reduce_axis((0, kernel_size), name="dw")
+            tvm.tir.all(x >= pad_left,
+                        x < pad_left + inp_width),
+            data[n, c, x - pad_left],
+            tvm.tir.const(0., "float32")),
+        name='data_pad')
+
+    padded_kernel = te.compute(
+        (inp_channels, out_channels, kernel_size + stride - 1),
+        lambda ci, co, k: tvm.tir.if_then_else(
+            tvm.tir.all(k < kernel_size),
+            kernel[ci, co, kernel_size-k-1],
+            tvm.tir.const(0., "float32")),
+        name='kernel_pad')
+
+    ci = te.reduce_axis((0, inp_channels), name='ci')
+    k = te.reduce_axis((0, (kernel_size + stride - 1)//stride), name='k')
+    border = pad_left * (stride - 1)
+
     data_out = te.compute(
         (batch, out_channels, out_width),
-        lambda b, c, w: te.sum(
-            data[b, dc, w + dw].astype(out_dtype)
-            * kernel[dc, c, kernel_size - 1 - dw].astype(out_dtype),
-            axis=[dc, dw],
-        ),
-        tag="conv1d_transpose_ncw",
-    )
+        lambda b, co, w: te.sum(
+            padded_data[b, ci, (border+w + stride - 1) // stride + k].astype(out_dtype) *

Review comment:
       Works for padding=0, kernel_size=2, stride=2, added a test for this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] alexgl-github commented on a change in pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
alexgl-github commented on a change in pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#discussion_r517580833



##########
File path: tests/python/topi/python/test_topi_conv1d_transpose_ncw.py
##########
@@ -94,6 +94,9 @@ def test_conv1d_transpose_ncw():
     verify_conv1d_transpose_ncw(1, 1, 10, 1, 5, 1, (0, 3), (0,))
     verify_conv1d_transpose_ncw(1, 1, 10, 1, 5, 1, (1, 3), (0,))
     verify_conv1d_transpose_ncw(1, 1, 10, 1, 5, 1, (2, 3), (0,))
+    verify_conv1d_transpose_ncw(1, 257, 128, 1, 512, 128, 256, (0,))
+    verify_conv1d_transpose_ncw(1, 257, 128, 2, 512, 128, 256, (0,))
+    verify_conv1d_transpose_ncw(1, 257, 128, 257, 512, 128, 256, (0,))

Review comment:
       Added more tests with kernel_size=stride




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] vinx13 merged pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
vinx13 merged pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] alexgl-github commented on a change in pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
alexgl-github commented on a change in pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#discussion_r517582147



##########
File path: python/tvm/topi/cuda/conv1d_transpose_ncw.py
##########
@@ -65,36 +66,43 @@ def conv1d_transpose_ncw(cfg, data, kernel, stride, padding, out_dtype, output_p
     out_width = (inp_width - 1) * stride + kernel_size - pad_left - pad_right + output_padding
     pad_left = kernel_size - 1 - pad_left
     pad_right = kernel_size - 1 - pad_right + output_padding
+    padded_width = pad_left + inp_width + pad_right
     dilated_width = stride * (inp_width - 1) + 1
-    data = te.compute(
-        (batch, inp_channels, pad_left + dilated_width + pad_right),
+    padded_dilated_width = pad_left + dilated_width + pad_right
+
+    padded_data = te.compute(
+        (batch, inp_channels, padded_width),
         lambda n, c, x: tvm.tir.if_then_else(
-            tvm.tir.all(
-                x >= pad_left,
-                x < pad_left + dilated_width,
-                tvm.tir.indexmod(x - pad_left, stride).equal(0),
-            ),
-            data[n, c, tvm.tir.indexdiv(x - pad_left, stride)],
-            tvm.tir.const(0.0, "float32"),
-        ),
-        name="data_pad",
-    )
-
-    dc = te.reduce_axis((0, inp_channels), name="dc")
-    dw = te.reduce_axis((0, kernel_size), name="dw")
+            tvm.tir.all(x >= pad_left,
+                        x < pad_left + inp_width),
+            data[n, c, x - pad_left],
+            tvm.tir.const(0., "float32")),
+        name='data_pad')
+
+    padded_kernel = te.compute(
+        (inp_channels, out_channels, kernel_size + stride - 1),
+        lambda ci, co, k: tvm.tir.if_then_else(
+            tvm.tir.all(k < kernel_size),
+            kernel[ci, co, kernel_size-k-1],
+            tvm.tir.const(0., "float32")),
+        name='kernel_pad')
+
+    ci = te.reduce_axis((0, inp_channels), name='ci')
+    k = te.reduce_axis((0, (kernel_size + stride - 1)//stride), name='k')

Review comment:
       I don't think ceil of floating point division can be used in reduce_axis. There.s no integer equivalent for ceil in tir.op, i've replaced it with tir.indexdiv(...)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] giuseros commented on a change in pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
giuseros commented on a change in pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#discussion_r517271380



##########
File path: tests/python/topi/python/test_topi_conv1d_transpose_ncw.py
##########
@@ -94,6 +94,9 @@ def test_conv1d_transpose_ncw():
     verify_conv1d_transpose_ncw(1, 1, 10, 1, 5, 1, (0, 3), (0,))
     verify_conv1d_transpose_ncw(1, 1, 10, 1, 5, 1, (1, 3), (0,))
     verify_conv1d_transpose_ncw(1, 1, 10, 1, 5, 1, (2, 3), (0,))
+    verify_conv1d_transpose_ncw(1, 257, 128, 1, 512, 128, 256, (0,))
+    verify_conv1d_transpose_ncw(1, 257, 128, 2, 512, 128, 256, (0,))
+    verify_conv1d_transpose_ncw(1, 257, 128, 257, 512, 128, 256, (0,))

Review comment:
       Could you add a test for the case kernel_size==stride? I think this is common in encoder/decoder networks

##########
File path: python/tvm/topi/cuda/conv1d_transpose_ncw.py
##########
@@ -65,36 +66,43 @@ def conv1d_transpose_ncw(cfg, data, kernel, stride, padding, out_dtype, output_p
     out_width = (inp_width - 1) * stride + kernel_size - pad_left - pad_right + output_padding
     pad_left = kernel_size - 1 - pad_left
     pad_right = kernel_size - 1 - pad_right + output_padding
+    padded_width = pad_left + inp_width + pad_right
     dilated_width = stride * (inp_width - 1) + 1
-    data = te.compute(
-        (batch, inp_channels, pad_left + dilated_width + pad_right),
+    padded_dilated_width = pad_left + dilated_width + pad_right
+
+    padded_data = te.compute(
+        (batch, inp_channels, padded_width),
         lambda n, c, x: tvm.tir.if_then_else(
-            tvm.tir.all(
-                x >= pad_left,
-                x < pad_left + dilated_width,
-                tvm.tir.indexmod(x - pad_left, stride).equal(0),
-            ),
-            data[n, c, tvm.tir.indexdiv(x - pad_left, stride)],
-            tvm.tir.const(0.0, "float32"),
-        ),
-        name="data_pad",
-    )
-
-    dc = te.reduce_axis((0, inp_channels), name="dc")
-    dw = te.reduce_axis((0, kernel_size), name="dw")
+            tvm.tir.all(x >= pad_left,
+                        x < pad_left + inp_width),
+            data[n, c, x - pad_left],
+            tvm.tir.const(0., "float32")),
+        name='data_pad')
+
+    padded_kernel = te.compute(
+        (inp_channels, out_channels, kernel_size + stride - 1),
+        lambda ci, co, k: tvm.tir.if_then_else(
+            tvm.tir.all(k < kernel_size),
+            kernel[ci, co, kernel_size-k-1],
+            tvm.tir.const(0., "float32")),
+        name='kernel_pad')
+
+    ci = te.reduce_axis((0, inp_channels), name='ci')
+    k = te.reduce_axis((0, (kernel_size + stride - 1)//stride), name='k')
+    border = pad_left * (stride - 1)
+
     data_out = te.compute(
         (batch, out_channels, out_width),
-        lambda b, c, w: te.sum(
-            data[b, dc, w + dw].astype(out_dtype)
-            * kernel[dc, c, kernel_size - 1 - dw].astype(out_dtype),
-            axis=[dc, dw],
-        ),
-        tag="conv1d_transpose_ncw",
-    )
+        lambda b, co, w: te.sum(
+            padded_data[b, ci, (border+w + stride - 1) // stride + k].astype(out_dtype) *
+            padded_kernel[ci, co, k*stride + tvm.tir.indexmod(stride-w-border, stride)].astype(out_dtype),
+            axis=[ci, k]), tag="conv1d_transpose_ncw")

Review comment:
       Could you add a bit more comments about the algorithm you are using?

##########
File path: python/tvm/topi/cuda/conv1d_transpose_ncw.py
##########
@@ -65,36 +66,43 @@ def conv1d_transpose_ncw(cfg, data, kernel, stride, padding, out_dtype, output_p
     out_width = (inp_width - 1) * stride + kernel_size - pad_left - pad_right + output_padding
     pad_left = kernel_size - 1 - pad_left
     pad_right = kernel_size - 1 - pad_right + output_padding
+    padded_width = pad_left + inp_width + pad_right
     dilated_width = stride * (inp_width - 1) + 1
-    data = te.compute(
-        (batch, inp_channels, pad_left + dilated_width + pad_right),
+    padded_dilated_width = pad_left + dilated_width + pad_right
+
+    padded_data = te.compute(
+        (batch, inp_channels, padded_width),
         lambda n, c, x: tvm.tir.if_then_else(
-            tvm.tir.all(
-                x >= pad_left,
-                x < pad_left + dilated_width,
-                tvm.tir.indexmod(x - pad_left, stride).equal(0),
-            ),
-            data[n, c, tvm.tir.indexdiv(x - pad_left, stride)],
-            tvm.tir.const(0.0, "float32"),
-        ),
-        name="data_pad",
-    )
-
-    dc = te.reduce_axis((0, inp_channels), name="dc")
-    dw = te.reduce_axis((0, kernel_size), name="dw")
+            tvm.tir.all(x >= pad_left,
+                        x < pad_left + inp_width),
+            data[n, c, x - pad_left],
+            tvm.tir.const(0., "float32")),
+        name='data_pad')
+
+    padded_kernel = te.compute(
+        (inp_channels, out_channels, kernel_size + stride - 1),
+        lambda ci, co, k: tvm.tir.if_then_else(
+            tvm.tir.all(k < kernel_size),
+            kernel[ci, co, kernel_size-k-1],
+            tvm.tir.const(0., "float32")),
+        name='kernel_pad')
+
+    ci = te.reduce_axis((0, inp_channels), name='ci')
+    k = te.reduce_axis((0, (kernel_size + stride - 1)//stride), name='k')
+    border = pad_left * (stride - 1)
+
     data_out = te.compute(
         (batch, out_channels, out_width),
-        lambda b, c, w: te.sum(
-            data[b, dc, w + dw].astype(out_dtype)
-            * kernel[dc, c, kernel_size - 1 - dw].astype(out_dtype),
-            axis=[dc, dw],
-        ),
-        tag="conv1d_transpose_ncw",
-    )
+        lambda b, co, w: te.sum(
+            padded_data[b, ci, (border+w + stride - 1) // stride + k].astype(out_dtype) *

Review comment:
       Are you sure this works for:padding=0, kernel_size=2, stride=2? 
   In this case `ceil(w,stride)` gives 0, 1, 1, 2, 2, .... While *I think* (but might be wrong) that you want 0, 0, 1, 1, 2, 2, etc... when you index the `padded_data`  tensor

##########
File path: python/tvm/topi/cuda/conv1d_transpose_ncw.py
##########
@@ -65,36 +66,43 @@ def conv1d_transpose_ncw(cfg, data, kernel, stride, padding, out_dtype, output_p
     out_width = (inp_width - 1) * stride + kernel_size - pad_left - pad_right + output_padding
     pad_left = kernel_size - 1 - pad_left
     pad_right = kernel_size - 1 - pad_right + output_padding
+    padded_width = pad_left + inp_width + pad_right
     dilated_width = stride * (inp_width - 1) + 1
-    data = te.compute(
-        (batch, inp_channels, pad_left + dilated_width + pad_right),
+    padded_dilated_width = pad_left + dilated_width + pad_right
+
+    padded_data = te.compute(
+        (batch, inp_channels, padded_width),
         lambda n, c, x: tvm.tir.if_then_else(
-            tvm.tir.all(
-                x >= pad_left,
-                x < pad_left + dilated_width,
-                tvm.tir.indexmod(x - pad_left, stride).equal(0),
-            ),
-            data[n, c, tvm.tir.indexdiv(x - pad_left, stride)],
-            tvm.tir.const(0.0, "float32"),
-        ),
-        name="data_pad",
-    )
-
-    dc = te.reduce_axis((0, inp_channels), name="dc")
-    dw = te.reduce_axis((0, kernel_size), name="dw")
+            tvm.tir.all(x >= pad_left,
+                        x < pad_left + inp_width),
+            data[n, c, x - pad_left],
+            tvm.tir.const(0., "float32")),
+        name='data_pad')
+
+    padded_kernel = te.compute(
+        (inp_channels, out_channels, kernel_size + stride - 1),
+        lambda ci, co, k: tvm.tir.if_then_else(
+            tvm.tir.all(k < kernel_size),
+            kernel[ci, co, kernel_size-k-1],
+            tvm.tir.const(0., "float32")),
+        name='kernel_pad')
+
+    ci = te.reduce_axis((0, inp_channels), name='ci')
+    k = te.reduce_axis((0, (kernel_size + stride - 1)//stride), name='k')

Review comment:
       Instead of writing `(kernel_size + stride - 1)//stride`, could you write `ceil(kernel_size/stride)`? At the end, it would make the code more understandable, and wouldn't hurt performance. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] anijain2305 edited a comment on pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
anijain2305 edited a comment on pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#issuecomment-721438682


   @vinx13 Can you PTAL for CUDA stuff?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] alexgl-github commented on a change in pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
alexgl-github commented on a change in pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#discussion_r517582553



##########
File path: python/tvm/topi/cuda/conv1d_transpose_ncw.py
##########
@@ -65,36 +66,43 @@ def conv1d_transpose_ncw(cfg, data, kernel, stride, padding, out_dtype, output_p
     out_width = (inp_width - 1) * stride + kernel_size - pad_left - pad_right + output_padding
     pad_left = kernel_size - 1 - pad_left
     pad_right = kernel_size - 1 - pad_right + output_padding
+    padded_width = pad_left + inp_width + pad_right
     dilated_width = stride * (inp_width - 1) + 1
-    data = te.compute(
-        (batch, inp_channels, pad_left + dilated_width + pad_right),
+    padded_dilated_width = pad_left + dilated_width + pad_right
+
+    padded_data = te.compute(
+        (batch, inp_channels, padded_width),
         lambda n, c, x: tvm.tir.if_then_else(
-            tvm.tir.all(
-                x >= pad_left,
-                x < pad_left + dilated_width,
-                tvm.tir.indexmod(x - pad_left, stride).equal(0),
-            ),
-            data[n, c, tvm.tir.indexdiv(x - pad_left, stride)],
-            tvm.tir.const(0.0, "float32"),
-        ),
-        name="data_pad",
-    )
-
-    dc = te.reduce_axis((0, inp_channels), name="dc")
-    dw = te.reduce_axis((0, kernel_size), name="dw")
+            tvm.tir.all(x >= pad_left,
+                        x < pad_left + inp_width),
+            data[n, c, x - pad_left],
+            tvm.tir.const(0., "float32")),
+        name='data_pad')
+
+    padded_kernel = te.compute(
+        (inp_channels, out_channels, kernel_size + stride - 1),
+        lambda ci, co, k: tvm.tir.if_then_else(
+            tvm.tir.all(k < kernel_size),
+            kernel[ci, co, kernel_size-k-1],
+            tvm.tir.const(0., "float32")),
+        name='kernel_pad')
+
+    ci = te.reduce_axis((0, inp_channels), name='ci')
+    k = te.reduce_axis((0, (kernel_size + stride - 1)//stride), name='k')
+    border = pad_left * (stride - 1)
+
     data_out = te.compute(
         (batch, out_channels, out_width),
-        lambda b, c, w: te.sum(
-            data[b, dc, w + dw].astype(out_dtype)
-            * kernel[dc, c, kernel_size - 1 - dw].astype(out_dtype),
-            axis=[dc, dw],
-        ),
-        tag="conv1d_transpose_ncw",
-    )
+        lambda b, co, w: te.sum(
+            padded_data[b, ci, (border+w + stride - 1) // stride + k].astype(out_dtype) *
+            padded_kernel[ci, co, k*stride + tvm.tir.indexmod(stride-w-border, stride)].astype(out_dtype),
+            axis=[ci, k]), tag="conv1d_transpose_ncw")

Review comment:
       Added notes about the index calculation 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] alexgl-github commented on pull request #6840: conv1d_transpose speedup

Posted by GitBox <gi...@apache.org>.
alexgl-github commented on pull request #6840:
URL: https://github.com/apache/incubator-tvm/pull/6840#issuecomment-721412452


   Speedup transposed 1d convolution by eliminating unnecessary multiplications by data values which contain zeroes, for stride greater then 1. 
   
   Below are current vs proposed latency numbers for various transposed conv1d parameters:
   `
      current:                new:
   latency=0.06837 sec  latency=0.00049 sec   channels_out=   1 input_shape=(1, 257, 128) kernel_size=512 strides=128 padding=256
   latency=0.53093 sec  latency=0.10341 sec   channels_out= 257 input_shape=(1, 257, 128) kernel_size=512 strides=4 padding=256
   latency=0.00292 sec  latency=0.00307 sec   channels_out=   1 input_shape=(1, 257, 128) kernel_size=512 strides=1 padding=256
   latency=0.00474 sec  latency=0.00171 sec   channels_out=   1 input_shape=(1, 257, 128) kernel_size=512 strides=2 padding=256
   latency=0.00955 sec  latency=0.00056 sec   channels_out=   1 input_shape=(1, 257, 128) kernel_size=512 strides=16 padding=256
   latency=0.00054 sec  latency=0.00023 sec   channels_out=   1 input_shape=(1, 1, 16384) kernel_size=512 strides=2 padding=256
   latency=0.00385 sec  latency=0.00089 sec   channels_out=   4 input_shape=(1, 1, 16384) kernel_size=512 strides=4 padding=256
   latency=0.00013 sec  latency=0.00004 sec   channels_out=   1 input_shape=(1, 1, 1024) kernel_size=512 strides=5 padding=256
   latency=0.00002 sec  latency=0.00002 sec   channels_out=  32 input_shape=(1, 3, 224) kernel_size=5 strides=1 padding=0
   latency=0.00004 sec  latency=0.00003 sec   channels_out=  32 input_shape=(1, 3, 224) kernel_size=5 strides=2 padding=0
   latency=0.00006 sec  latency=0.00003 sec   channels_out= 128 input_shape=(1, 32, 32) kernel_size=5 strides=2 padding=0
   
   `
     


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org