You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/04/28 00:08:57 UTC

[GitHub] [tvm] tkonolige opened a new pull request #7935: [SPARSE] Improve sparse performance on ROCM

tkonolige opened a new pull request #7935:
URL: https://github.com/apache/tvm/pull/7935


   The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster.
   
   @tmoreau89 @jwfromm 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7935: [SPARSE] Improve sparse performance on ROCM

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7935:
URL: https://github.com/apache/tvm/pull/7935#discussion_r622580205



##########
File path: python/tvm/topi/cuda/sparse.py
##########
@@ -170,6 +170,16 @@ def gen_ir(data, w_data, w_indices, w_indptr, out):
         # TODO(tkonolige): seperate implementation for large block sizes
         ib = tvm.tir.ir_builder.create()
 
+        if tvm.target.Target.current(allow_none=False).kind.name == "rocm":

Review comment:
       I've defaulted to not using warp storage unless we are on cuda.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on a change in pull request #7935: [SPARSE] Improve sparse performance on ROCM

Posted by GitBox <gi...@apache.org>.

masahi commented on a change in pull request #7935:
URL: https://github.com/apache/tvm/pull/7935#discussion_r622475973



##########
File path: python/tvm/topi/cuda/sparse.py
##########
@@ -170,6 +170,16 @@ def gen_ir(data, w_data, w_indices, w_indptr, out):
         # TODO(tkonolige): seperate implementation for large block sizes
         ib = tvm.tir.ir_builder.create()
 
+        if tvm.target.Target.current(allow_none=False).kind.name == "rocm":

Review comment:
       I've never tested this kernel on vulkan, but since our vulkan and opencl target do not have a default `thread_warp_size` specified, I'm pretty sure trying to use warp instruction there wouldn't work. Moreover, even if we had a default `thread_warp_size` for vulkan for example, lowering warp instruction needs dedicated support from codegen that is not there for vulkan and opencl.
   
   So for now, I think `use_warp_storage` should be True only for CUDA.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi edited a comment on pull request #7935: [SPARSE] Improve sparse performance on ROCM

Posted by GitBox <gi...@apache.org>.

masahi edited a comment on pull request #7935:
URL: https://github.com/apache/tvm/pull/7935#issuecomment-828042725


   This post says: "They (`ds_permute` and `ds_bpermute` instructions) use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location". I don't know what they mean by "route without actually writing".
   https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/
   
   I wonder if both approaches use shared memory, why the explicit way as in this PR is faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] t-vi commented on pull request #7935: [SPARSE] Improve sparse performance on ROCM

Posted by GitBox <gi...@apache.org>.

t-vi commented on pull request #7935:
URL: https://github.com/apache/tvm/pull/7935#issuecomment-828778315


   I don't think the descriptions are entirely accurate, but the Vega ISA manual says
   
   > This does not access LDS memory and may be called even if no LDS memory is allocated to the wave. It uses LDS hardware to implement an arbitrary swizzle across threads in a wavefront. 
   
   so I would expect that the performance lies somewhere between using LDS and registers. I can imagine that doing a lot less writing might save time in this specific case, but it probably is best to check with AMD before drawing global conclusions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on pull request #7935: [SPARSE] Improve sparse performance on ROCM

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #7935:
URL: https://github.com/apache/tvm/pull/7935#issuecomment-828042725


   This post says: "They (`ds_permute` and `ds_bpermute` instructions) use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location"
   https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/
   
   I wonder if both approaches use shared memory, why the explicit way as in this PR is faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on pull request #7935: [SPARSE] Improve sparse performance on ROCM

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #7935:
URL: https://github.com/apache/tvm/pull/7935#issuecomment-828718501


   I'm planning to work on improving our GPU scan kernel using warp shuffle instructions, so I want to investigate this issue when I get there. Warp shuffle on AMD being slower than shared memory sounds surprising and counter intuitive. In the PR that introduced warp shuffle support to TVM rocm, https://github.com/apache/tvm/pull/5727, @t-vi mentioned that he got a good speed up on softmax reduction https://github.com/apache/tvm/pull/5727#issuecomment-639109441. So I was under impression that warp shuffle is generally a good thing on AMD too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on pull request #7935: [SPARSE] Improve sparse performance on ROCM

Posted by GitBox <gi...@apache.org>.

tkonolige commented on pull request #7935:
URL: https://github.com/apache/tvm/pull/7935#issuecomment-828554093


   @masahi With `ds_permute`, we do a write and read from LDS for each of the 64 element accesses, vs doing a single write to LDS and 64 reads with the approach in this PR.
   
   Lower down it says "All active lanes write data to a temporary buffer. All active lanes read data from the temporary buffer...".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi merged pull request #7935: [SPARSE] Improve sparse performance on ROCM

Posted by GitBox <gi...@apache.org>.

masahi merged pull request #7935:
URL: https://github.com/apache/tvm/pull/7935


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org