You are viewing a plain text version of this content. The canonical link for it is here.

Posted to discuss-archive@tvm.apache.org by Iz Beltagy via TVM Discuss <no...@discuss.tvm.ai> on 2020/04/14 20:24:08 UTC

[TVM Discuss] [Questions] Developing a faster schedule for Longformer's kernel


We recently released a transformer model for long documents that is powered by a custom CUDA kernel implemented in TVM ([Here's](https://twitter.com/ApacheTVM/status/1249883784410873856) TVM account tweeting about it). 

Would anyone be interested in implementing a faster schedule for the kernel? I think it will be a great showcase for the usability and efficiency of TVM, and can have a big impact on the NLP community. 

In case anyone is interested, here is some background: 

- The kernel is a form of banded matrix multiplication where we only compute certain diagonals of the output matrix (check figures 2.b, 2.c in the [paper](https://arxiv.org/pdf/2004.05150.pdf)).

- Our schedule [here](https://github.com/allenai/longformer/blob/master/longformer/diagonaled_mm_tvm.py#L85) is 16x slower than it should be.

- the `batched_matmul` schedule [here](https://github.com/facebookexperimental/tvm/blob/master/topi/python/topi/cuda/batch_matmul.py#L71) is 2x faster than ours for the setting in figure 2.b (I will use it instead of our schedule for this case), and it is much worse than our schedule for the setting in figure 2.c. 

So the question is if we can implement a schedule faster than ours and `batched_matmul`. If anyone is interested in working on this, please let me know. 

Thanks





---
[Visit Topic](https://discuss.tvm.ai/t/developing-a-faster-schedule-for-longformers-kernel/6367/1) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/f05da8488cada68101567c83b3cc8b2abb9082f9681f1c2087d38e5b6943e67b).

[TVM Discuss] [Questions] Developing a faster schedule for Longformer's kernel

Posted by Iz Beltagy via TVM Discuss <no...@discuss.tvm.ai>.


Sorry for the complicated code. I didn't know how to compile multiple kernels into one `.so` file so ended up cramming three functions (one for the forward pass and 2 for the backward) into one with flags to switch between them. 
Here are the constants for the forward pass
```
b = 1  # batch size
n = 4096  # sequence length
h = 12  # number of heads (this dimension can be merged with the batch size if needed)
m = 768  # hidden dimension -> 768
w = 256  # window size on one side
w_upper = 256  # window size to the right of the word. Should be `w` for the non-autoregressive case
padding = 0 # padding -> any const
transpose_t1 = 0  # `0` for one of the backward functions and `1` for the other, doesn't matter for the forward
t1d3 = 768  # last dimension of t1 -> this is `m` for the forward function and `2w+1` (number of diagonals) for the backward
t3d3 = 513  # last dimensions of t3, this is 2w+1 for the forward pass
```





---
[Visit Topic](https://discuss.tvm.ai/t/developing-a-faster-schedule-for-longformers-kernel/6367/6) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/e1355423711038c18285628d43310b696a4fd0a407d09d2255da63dd68ebd2b6).

[TVM Discuss] [Questions] Developing a faster schedule for Longformer's kernel

Posted by Bing Xu via TVM Discuss <no...@discuss.tvm.ai>.


  Could you provide these parameters?

    b = tvm.var('b')  # batch size -> 12 or n
    n = tvm.var('n')  # sequence length -> 512
    h = tvm.var('h')  # number of heads -> ?
    m = tvm.var('m')  # hidden dimension -> 768
    w = tvm.var('w')  # window size -> 512
    w_upper = tvm.var('w_upper')  # window size to the right of the word. Should be `0` or `w` -> ?
    padding = tvm.var('padding')  # padding -> ?
    transpose_t1 = tvm.var('transpose_t1')  # t1 should be transposed -> True / False
    t1d3 = tvm.var('t1d3')  # last dimension of t1 -> ?
    t3d3 = tvm.var('t3d3')  # last dimension of t3 (the result tensor) -> ?





---
[Visit Topic](https://discuss.tvm.ai/t/developing-a-faster-schedule-for-longformers-kernel/6367/5) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/68c35d21cc36e3ea2fe966a01e3e6d9d41e9f1b00a838c29d3c4b9c3c03c9052).

[TVM Discuss] [Questions] Developing a faster schedule for Longformer's kernel

Posted by Iz Beltagy via TVM Discuss <no...@discuss.tvm.ai>.


If you want one or two specific configurations to work with, they would be:
- batch size = 12 (but `batch_matmul_schedule` didn't require a constant batch size, so maybe this doesn't need to be constant)
- embedding size: 768
- sequence length: 4,096
- window size: 512
- dilation: 0 and 3  (I think a lot of the locality assumptions for caching will break once we start working with non-zero dilation. That's why we need to study both cases, 0 because it is the most common, and 3 because it is representative of the cases where locality breaks)





---
[Visit Topic](https://discuss.tvm.ai/t/developing-a-faster-schedule-for-longformers-kernel/6367/4) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/d5844d68115645773296e165be301506bcda1d2e6a543a5c11898b7b19dee893).