You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/06/01 23:09:54 UTC

[GitHub] [tvm] altanh opened a new pull request, #11531: [TOPI] TE implementation of LSTM using scan

altanh opened a new pull request, #11531:
URL: https://github.com/apache/tvm/pull/11531

   This PR adds a TE implementation of LSTM (with optional modifications, similar to those in https://github.com/apache/tvm/blob/main/python/tvm/relay/frontend/common.py#L774), using the `te.scan` construct (so that the recurrent loop is truly a sequential loop, rather than unrolled statically). This compute should support symbolic sequence length.
   
   Missing from this PR:
   - optimized schedule for any target
   - corresponding higher-level Relay op
   - attempts to use metaschedule (more on this later)
   
   I'll send a follow-up PR for the Relay op, but scheduling the LSTM might take a while (if anyone is interested, please feel free to take a stab!). The main thing to optimize is the dense operations within the kernel (the initial input-hidden dense, recurrent hidden-hidden dense, and hidden-projection dense). I couldn't figure out a great way to use existing schedules here...
   
   Things I am hoping to try:
   - Fix some variant of LSTM and write an S-TIR kernel for it, then try to schedule individual blocks (maybe reusing existing stuff if possible). Because LSTM has a lot of optional stuff, I'm not sure how easy it would be to do tvmscript-level metaprogramming to inject optional computations etc.
   - Once the Relay op is up, add a cuDNN strategy as an option for NVIDIA gpus
   
   Regarding metascheduling: the current `CreatePrimFunc` conversion from TE -> S-TIR doesn't support scan operations. I have a hack that makes this conversion work, but am hitting some snags regarding schedule rules, primitives, and post procs (the outer scan axis seems to break a lot of assumptions). I can try to clean up this conversion if that's valuable, but also am curious if anyone is interested in tackling this by adjusting the constraints on blocks to support outer scan axis.
   
   cc @vinx13 @junrushao1994 @tkonolige @michalpiszczek @masahi 
   
   Additional thanks to @vinx13 and @zxybazh for helping debug metaschedule issues (I hope this PR helps as a concrete starting point for getting things working), maybe you guys can cc others who may be interested? And thanks @junrushao1994 for the very helpful LSTM example from ~5 (!) years ago https://github.com/apache/tvm/blob/main/apps/topi_recipe/rnn/lstm.py which I used as a starting point.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [tvm] vinx13 merged pull request #11531: [TOPI] TE implementation of LSTM using scan

Posted by GitBox <gi...@apache.org>.
vinx13 merged PR #11531:
URL: https://github.com/apache/tvm/pull/11531


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [tvm] altanh commented on pull request #11531: [TOPI] TE implementation of LSTM using scan

Posted by GitBox <gi...@apache.org>.
altanh commented on PR #11531:
URL: https://github.com/apache/tvm/pull/11531#issuecomment-1146362726

   > This is a great first set of steps towards improving LSTM performance. Could you comment on what this unscheduled performance looks like vs what we currently have in TVM?
   
   It's pretty terrible with the naive dense loops, even compared to untuned TVM with default schedules. For example (on a 5900X):
   ```
      seq_len = 80
   batch_size = 1
       in_dim = 512
   hidden_dim = 256
   
   compiling TE LSTM...
     took 0.04480266571044922 seconds.
   TOPI mean (ms): 48.967991919999996
   compiling Relay unrolled LSTM...
   One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
     took 42.43188190460205 seconds.
   Relay mean (ms): 14.4790252
   ```
   At least it compiles quickly, haha. The Relay baseline comparison uses the `lstm_cell` from `relay/frontend/common.py`. Note that for this benchmark I did super basic scheduling by inlining the gate and activation computations.
   
   Reducing the input and hidden dimensions shows some gains in terms of reduced kernel overhead I think (sequence length increased to exaggerate effect):
   ```
      seq_len = 256
   batch_size = 1
       in_dim = 16
   hidden_dim = 16
   
   compiling TE LSTM...
     took 0.057991743087768555 seconds.
   TOPI mean (ms): 0.14541639
   compiling Relay unrolled LSTM...
   One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
     took 708.0528562068939 seconds.
   Relay mean (ms): 2.62690786
   ```
   (the compile time is pretty ridiculous on Relay)
   
   Here's the script I used for benchmarking: https://gist.github.com/altanh/a6dc8bf633028eaca5fbedbb591064f2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [tvm] tkonolige commented on pull request #11531: [TOPI] TE implementation of LSTM using scan

Posted by GitBox <gi...@apache.org>.
tkonolige commented on PR #11531:
URL: https://github.com/apache/tvm/pull/11531#issuecomment-1144236619

   This is a great first set of steps towards improving LSTM performance. Could you comment on what this unscheduled performance looks like vs what we currently have in TVM?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [tvm] altanh commented on pull request #11531: [TOPI] TE implementation of LSTM using scan

Posted by GitBox <gi...@apache.org>.
altanh commented on PR #11531:
URL: https://github.com/apache/tvm/pull/11531#issuecomment-1146362724

   > This is a great first set of steps towards improving LSTM performance. Could you comment on what this unscheduled performance looks like vs what we currently have in TVM?
   
   It's pretty terrible with the naive dense loops, even compared to untuned TVM with default schedules. For example (on a 5900X):
   ```
      seq_len = 80
   batch_size = 1
       in_dim = 512
   hidden_dim = 256
   
   compiling TE LSTM...
     took 0.04480266571044922 seconds.
   TOPI mean (ms): 48.967991919999996
   compiling Relay unrolled LSTM...
   One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
     took 42.43188190460205 seconds.
   Relay mean (ms): 14.4790252
   ```
   At least it compiles quickly, haha. The Relay baseline comparison uses the `lstm_cell` from `relay/frontend/common.py`. Note that for this benchmark I did super basic scheduling by inlining the gate and activation computations.
   
   Reducing the input and hidden dimensions shows some gains in terms of reduced kernel overhead I think (sequence length increased to exaggerate effect):
   ```
      seq_len = 256
   batch_size = 1
       in_dim = 16
   hidden_dim = 16
   
   compiling TE LSTM...
     took 0.057991743087768555 seconds.
   TOPI mean (ms): 0.14541639
   compiling Relay unrolled LSTM...
   One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
     took 708.0528562068939 seconds.
   Relay mean (ms): 2.62690786
   ```
   (the compile time is pretty ridiculous on Relay)
   
   Here's the script I used for benchmarking: https://gist.github.com/altanh/a6dc8bf633028eaca5fbedbb591064f2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org