You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by "MasterJH5574 (via GitHub)" <gi...@apache.org> on 2024/03/09 15:02:17 UTC

[PR] [Runtime] PagedKVCache execute data copy on a separate stream [tvm]

MasterJH5574 opened a new pull request, #16692:
URL: https://github.com/apache/tvm/pull/16692

   This PR enhances PagedKVCache with the copy stream separation. In detail, for CUDA and ROCm backend, we create a standalone copy stream for the copy of auxiliary data structure from CPU to GPU. Furthermore, we move the copy from BeginForward to Attention, which means it's no longer eagerly executed, instead, becoming lazily executed when Attention computation is needed.
   
   By making these changes, we are able to overlap the auxiliary data copy time (on the copy stream) with the model forward computation that happens before the first Attention. As a result, we can hide some of the copy latency.
   
   This PR also bumps the version of FlashInfer for the copy stream support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [Runtime] PagedKVCache execute data copy on a separate stream [tvm]

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen merged PR #16692:
URL: https://github.com/apache/tvm/pull/16692


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org