You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by "abhikran-quic (via GitHub)" <gi...@apache.org> on 2024/03/21 14:05:13 UTC

[PR] [VM][Hexagon] Cache operations when bypass mode is enabled [tvm]

abhikran-quic opened a new pull request, #16762:
URL: https://github.com/apache/tvm/pull/16762

   - This is needed as Hexagon DMA engine expects cache maintenance by applications.
   - This change ensures accuracy in bypass_cache mode.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [VM][Hexagon] Cache operations when bypass mode is enabled [tvm]

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #16762:
URL: https://github.com/apache/tvm/pull/16762#issuecomment-2015587174

   Ah i see it is vm.builtin and not the normal dma_wait in tir loops, now it makes sense since we don't normally do slice access at relax level. It makes sense to me now. Thank you @abhikran-quic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [VM][Hexagon] Cache operations when bypass mode is enabled [tvm]

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #16762:
URL: https://github.com/apache/tvm/pull/16762#issuecomment-2012645578

   @abhikran-quic just so that i understand, this PR adds cache flush and invalidate, are these necessary for every DMA operation? Since this likely seems to have performance implications
   
   For normal cases where we have an acceleratorI(NPU) and a CPU, we can use DMA from accelerator to copy data into NPU, and cache invalidation, flush is only needed when CPU would like to see that piece of memory.
   
   So an optimization would be for the ops that NPU runs, we always do not do flush/invalidation (as they are always coherent from NPU's pov), then only during CPU to NPU or NPU to CPU ops, we do cache flush, invalidation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [VM][Hexagon] Cache operations when bypass mode is enabled [tvm]

Posted by "abhikran-quic (via GitHub)" <gi...@apache.org>.

abhikran-quic commented on PR #16762:
URL: https://github.com/apache/tvm/pull/16762#issuecomment-2016378863

   Thank you @tqchen !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [VM][Hexagon] Cache operations when bypass mode is enabled [tvm]

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #16762:
URL: https://github.com/apache/tvm/pull/16762#issuecomment-2015583884

   get it, one further question. Is there a case where we choose to by pass, and not explicitly flush/invalidate at all? Or would it helpful to have some explicit memory barrier once
   
   Say for example, say we have multiple DRAM=>DMA=>VTCM copies from slices of the same buffer in a kernel, would it be more efficient to do cache invalidate once for entire buffer then always bypass without invalidation
   
   ```
   invalidate_cache(buffer)
   for i in range():
        dma_copy(buffer[i*256: i*256+256], dst_vtcm, bypass=True)
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [VM][Hexagon] Cache operations when bypass mode is enabled [tvm]

Posted by "abhikran-quic (via GitHub)" <gi...@apache.org>.

abhikran-quic commented on PR #16762:
URL: https://github.com/apache/tvm/pull/16762#issuecomment-2015535087

   > @abhikran-quic thank you! can you give an example of the intended usecase?Just so we can understand more about background context. The PR now seems to suggest if bypass_cache=True, then flush/invalidation will also happen to ensure correctness, but can that cause some optimal performance (i am just using the NPU example, e.g. is flush/invalidation always necessary)?
   
   Hi @tqchen ,
   
   To give you some more backgroud, the goal of DMA builtins(`dma_wait` and `dma_copy`) for Hexagon is to replace synchronous(blocking) copy operations(which can lead to stalls at runtime) with asynchronous copy operations. DMA copy operations can be performed in parallel while running some operators on Hexagon vector/scalar core. An example is mentioned below:
   
   ```
   @R.function
       def main(input: R.Tensor((...), dtype="uint8"), weight: R.Tensor((...), dtype="uint8")) -> R.Tensor((...), dtype="uint8"):
           R.func_attr({"operator_name": "main"})
           cls = Module
           with R.dataflow():
             lv0 = R.call_builtin_with_ctx("vm.builtin.hexagon.dma_copy", (weight), mem_scope = "global.vtcm")
             lv1 = R.call_tir(bias_add, (input,), out_sinfo=R.Tensor((...), dtype="uint8"))
             lv2 = R.call_tir(relu, (lv1,), out_sinfo=R.Tensor((...), dtype="uint8"))
             lv3 =R.call_builtin_with_ctx("vm.builtin.hexagon.dma_wait", (lv0,), mem_scope = "global.vtcm")
             gv = R.call_tir(conv, (lv2, lv3), out_sinfo=R.Tensor((...), dtype="uint8"))
             R.output(gv)
          return gv
   ``` 
   
   The IR above shows that `dma_copy` operation will be the first op in the graph and would inform the DMA engine to copy weights from DDR to VTCM. While the async copy happens, bias_add and relu ops can be executed on HVX/scalar hexagon core. A `dma_wait` operation is introduced before `conv` operation to ensure that DMA engine will finish copying weights and the data is available in VTCM for `conv` operation to proceed.
   
   In the present PR, we intend to use `bypass_cache` mode supported by DMA engine to copy data faster to VTCM. This is expected to be faster than going via cache. 
   
   For your question on whether cache flush/invalidation can cause some performance degradation, in theory this is not expected and when we tried some experiments, we observe about ~7-10% performance improvement when `bypass_cache` is enabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [VM][Hexagon] Cache operations when bypass mode is enabled [tvm]

Posted by "quic-sanirudh (via GitHub)" <gi...@apache.org>.

quic-sanirudh merged PR #16762:
URL: https://github.com/apache/tvm/pull/16762


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [VM][Hexagon] Cache operations when bypass mode is enabled [tvm]

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #16762:
URL: https://github.com/apache/tvm/pull/16762#issuecomment-2015175518

   @abhikran-quic thank you! can you give an example of the intended usecase?Just so we can understand more about background context. The PR now seems to suggest if bypass_cache=True, then flush/invalidation will also happen to ensure correctness, but can that cause some optimal performance (i am just using the NPU example, e.g. is flush/invalidation always necessary)? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [VM][Hexagon] Cache operations when bypass mode is enabled [tvm]

Posted by "abhikran-quic (via GitHub)" <gi...@apache.org>.

abhikran-quic commented on PR #16762:
URL: https://github.com/apache/tvm/pull/16762#issuecomment-2014270845

   > @abhikran-quic just so that i understand, this PR adds cache flush and invalidate, are these necessary for every DMA operation? Since this likely seems to have performance implications
   > 
   > For normal cases where we have an acceleratorI(NPU) and a CPU, we can use DMA from accelerator to copy data into NPU, and cache invalidation, flush is only needed when CPU would like to see that piece of memory.
   > 
   > So an optimization would be for the ops that NPU runs, we always do not do flush/invalidation (as they are always coherent from NPU's pov), then only during CPU to NPU or NPU to CPU ops, we do cache flush, invalidation
   > 
   > Say our ops is
   > 
   > ```
   > cpuopA => npuOpB=> npuOpC => npuOpD => cpuopE
   > ```
   > 
   > We only do cache flush after `cpuopA`(so NPU can see the result in DRAM), and cache invalidate after `npuOpD`
   
   Hi @tqchen ,
   Yes, you are right. For a normal CPU <-> NPU communication, cache operations are needed when CPU would like to read the final output of a model(For ex. output of `npuOpD`). 
   
   In case of Hexagon, there's a dedicated DMA engine to allow async copy of data into TCM. The DMA engine supports a mode where cache(L1/L2) can be bypassed and data is copied directly from DDR -> TCM. In such a scenario, the hardware engine expects the application to manage cache operations. Otherwise stale data gets picked up leading to inaccuracy. Hence, this change is introduced specifically for Hexagon and might not be applicable for normal CPU <-> NPU communication.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org