You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/04/28 00:09:24 UTC

[GitHub] [tvm] masahi commented on pull request #7935: [SPARSE] Improve sparse performance on ROCM

masahi commented on pull request #7935:
URL: https://github.com/apache/tvm/pull/7935#issuecomment-828042725


   This post says: "They (`ds_permute` and `ds_bpermute` instructions) use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location"
   https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/
   
   I wonder if both approaches use shared memory, why the explicit way as in this PR is faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org