You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/09/15 19:55:28 UTC

[GitHub] [tvm] mbs-octoml opened a new issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

mbs-octoml opened a new issue #9022:
URL: https://github.com/apache/tvm/issues/9022


   https://github.com/apache/tvm/blob/2aebd3335d89bb32d330b0f851ddaf2d551fc56e/src/tir/transforms/lower_tvm_builtin.cc#L115
   
   This particular code has a complex history but the upshot is we need finer grained control for Allocate statements which is not gated by device_type, and the storage_scope in ProducerStore stmts fits the bill.
   
   This is a placeholder for figuring that out since I'm unfamiliar with storage handling once we enter TIR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921820526


   Please allow me to explain the overall rationale here, in particular over the term "constraint"
   
   - C0: On one hand, we want a "default" memory to be generically accessible (per @manupa-arm 's comment) in all cases so runtime libraries can be built to leverage the generic property(e.g. access from NPU).
   - C1: On the other hand, from compiler's POV, we want to leave flexibility to code optimizations and codegen phase, and only constraint on the property we want (e.g. accessible from CPU)
   
   ## The two ways to see the constrainting
   
   The way C0 sees the constaint is about the possible accessor of the memory
   - "global"=> memory can be accessed from {cpu, npu, other devices}
   - "global.stack" => memory can be accessed from { cpu }
   
   The way C1 sees the constraint is about possible memories to choose from.
   - "global"(memory that is accessible from CPU) => can choose from {stack, TVMBAW, other allocators}
   - "global.workspace" => can choose from  { TVMBAW }
   - "global.stack" => can choose from  { stack }
   
   ## Discussions
   
   When we say a compiler IR is more constrainted than another one. Usually we mean that less optimizations can be performed, because there is lack of flexibility in terms of rewriting. For example, `volatile` keyword puts additional constraints on memory access.
   
   This makes C1 more aligned in the common compiler IR design. Note that "memory that is accessible from all devices" is term that depends on the specific runtime platform, and not very well defined in a generic IR. 
   The more constraints we put on the memory itself, the smaller set it can become. As a result, there are less opportunities
   of code transformations and optimizations.
   
   Under the current semantics  "CPU" && "global" can result in stack allocation. Note that is is one kind of flexibility we want to offer to later stages so that specializations can be made.  
   
   - So yes it is indeed OK for a pass to map "global" to TVMBAW, the resulting program will run slower, but still correctly on CPU. 
   - It also does not precldue TVMLowerBuiltin to take benefit of the semantics to choose stack allocation, which usually benefit performance. 
   
   One thing we should keep in mind is that the codegen and AOT compiler should not rely on the behavior of TVMLowerBuiltin to ensure correctness(since it can choose to do anything in the case of "global", including dispatching to another custom allocator). If a special kind of memory is needed, we should declare such constraint through IR. Attaching a special scope is the best way to do so under the current semantics, regardless of the implementation of TVMLowerBuiltin.
   
   
   TVMLowerBuiltin picks `kMaxStackAllocaSize` as a heuristic number that maximizes the benefit of stack allocation without overexploding the stack. Of course a better heuristic can be used, setting it default to 0 would bring down performance of a lot of CPU code so not as desirable. We could certainly have a target dependent property for micro target and set that to 0. It should be pretty easy to do as well, see https://github.com/apache/tvm/blob/main/src/tir/transforms/lower_warp_memory.cc#L392 that obtains a target dependent warp size 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921220669


   Thanks @manupa-arm . I understand that proposal R4 can also work by having a pass to convert "global" to something more specialize as a pass (essentially R1 and R4 are not that different except for different choices of scope names).
   
   The main question is what is the semantics around the scope "global". Each memory scope represent a "constraint" of what kind of memory it is. 
   
   Right now, when the device type is CPU,  "global" means any memory that can be accessed by the host cpu. This means the actual implement can come from include TVMBAW, memory from stack, or memory allocated by other means. While a memory allocated by TVMBAW can have other benefit(e.g. accessible by other devices because it is pinned), it is not the constraint specified by the "global" scope.
   
   We can of course further constraint the setting, to be say "global.workspace", that reduces the possible ways to allocate the memory, but still not preclude from choosing between multiple buffers.
   
   So from semantics point of view. The pass can indeed choose to return "global" or rewrite to "global.stack" to ensure it is a stack allocation. But if the scope remains "global", we should not preclude the possibility for downstream from allocating from stack(the code generator should be able to choose any kind that satisfies the constraint). 
   
   If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly. 
   
   As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920882592


   Thanks for the discussions. Before we suggest a resolution, it would be helpful
   to summarize the discussions so far. 
   
   ## Semantics of Allocate and storage_scope
   
   Allocate at the moment does not specify the way to allocate the memory.
   The lowering phase can feel free to pick a runtime  dependent one or a 
   more platform native one(e.g. stack alloca or direct declaration of shared memory).
   
   In practie, we see two different kinds of needs:
   - N0: The need for optimizing "device" level code.
   - N1: The need to be able to allocate memory from the host side, possibly
     in a runtime dependent way.
   
   N0 is needed to get best performing kernel, since a native way of allocation 
   that can be understood by the compiler would give the best chance for followup
   optimizations. This is the case for CPU related optimizations. Note that such
   optimization is also needed for micro setting, when we use TIR to generate kernel
   code that requested a temp scracth memory for output tiling.
   
   N1 is needed for "driving" the set of computations. In the case of GPU, we have
   to rely on host side driver function. In the case of micro AOT, it is desirable
   to allocate using platform specific memory.
   
   One possible approach is to try to ask user to differentiate these two kinds of
   allocations. However, to give user maximum convenience, TVM allows a mixed host/device
   programming model like CUDA and hide that complexity from the user. The boundary
   of N0 and N1 can also be blurred, e.g. a kernel can request to allocate from platform
   dependent memory.
   
   In specific specialized target devices, we also see an additional need:
   
   - N2: Allocating memory with special semantics. 
   
   For example, an unified device pinned memory that is accessible from both NPU and CPU. 
   A specialized texture memory or shared memory. The request in N2 is quite different 
   and brings additional requirement to how we allocate the memory and how the memory can be 
   used and represented in codegen.
   
   Right now N2 can be covered by introducing a special memory tag -- "global.npu_unified".
   This will usually allow the TVM to know that the memory needed to be lowered by possibly
   special target dependent passes, and leave them out.
   
   ## Possible Resolutions
   
   Based on discussions so far, there are two suggested resolutions.
   
   - R0: Separate out a "local" scope that carries the stack allocation heavior.
   - R1: Keep "global" scope as it is, introduce a special tagged scope "global.workspace"
     that represents a global memory request specifically for workspace memory. 
     And introduce lowerings for them. For specialized memory(e.g. NPU unified), introduce
     separate memory scopes. 
   - R3: Introduce target specific attribute that marks the possible stack alloca size for 
     lowering the R0("local") or R1("global"). Note R3 can be done in addition to R0 or R1.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920880447


   I feel that is a workaround for an optimization that certain set of CPUs require it.
   
   Moreover, TVMBAWs allows an abstraction to use an Arena that could be used in any memory for non-micro cases. Therefore it feels more like the general solution we want unless a specific target/option requires it placed on the stack.
   
   Landing the above change that provides an optimization for certain CPUs in certain cases where creating functional problems for other backend does not seem like the right thing to do. I'd rather revert the change for now and land the correct one that works for the general case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921152468


   Thanks @tqchen .
   
   In the R4, I was not suggesting to change TOPI but saying we could just do a Pass to change the storage scope. What are your thoughts about making the pass to make the storage_scope 'global.stack' after the pass? This should be backward compatible as it just performs an explicit specialization transparently in the IR. We could even do the max alloca size check in this particular pass.
   
   This makes it much clear and specific. Moreover, global.stack will not require further specialization unlike global.workspace where we would want to choose between multiple workspace buffers. The reason being I'd like to keep the tag to denote between global memories the compute devices have access to.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921934672


   Thanks @tqchen for the explanation of two viewpoints of how we could see the constraints.
   
   I do agree that we should put more constraints progressively to guide the compiler in the lowering. 
   
   It is just that the treatment of TVMBAW as a peer to heap and stack seemed not right. In fact one could use TVMBAW to perform heap allocation. My concern was that we are taking a decision way down in the compilation flow where it could have been taken transparently in the IR itself a bit higher.
   
   I think we are moving there with scopes but it would have been nicer to stick to TVMBAW for now as it is the more general lowering for 'global' and I would not see necessarily that as an orthogonal choice to the list you have mentioned. It just boils to the fact that we just want them to be on stack for mem2reg optimizations. While I dont disagree with the logic of the argument, but wish it was more explicit higher up in the compilation flow. If it was not for mem2reg optimizations, one could simply provide a Arena that could provide the workspace required as it would from the stack -- thus it seemed to me like a better interrim solution until we specialize the scope in the lowering.
   
   Yes, as you suggested the implementation of the target-dependent query for the max alloca size is not particularly challenging, it is just the API that we provide for the user is what we were worried about. This is important especially "micro" is not a target really in TVM -- so the short term solution seems like we would need one of the following TargetKind attributes for C and LLVM backends : 
   
   A1 :  --max-stack-alloca-size
   A2 :  --max-stack-size
   A3 :  --use-backend-alloc-workspace
   
   So here ideally from UX point of view, it would be better to give the user A2 argument rather than a constraining the size of single allocation seems like a proxy to control the behaviour of mem2reg transformations. However, that does not match with to behavior controlled by kMaxStackAllocaSize.
   
   A3 on the other hand set kMaxStackAllocaSize to zero and forcing all the allocates to be serviced by TVMBAW which could be from heap or Arena placed anywhere controlled by the runtime/application.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921782373


   @tqchen ,
   
   > To say it in another way, we cannot say that "global" definitely mean no stack allocation.
   
   The current issue is in the device "CPU" && 'global' certainly means its definitely stack allocation if its less that heuristic size and not the other way around.
   
    > If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly.
   
   > As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory
   
   Yes this is something we want eventually and we will be working towards achieving with USMP work.
   
   Until we have that, the natural assumption should be in absense of a 'constraint' that the memories are more accessible rather than being less accessible (e.g. stack). Its unfortunate that the current design prefers the latter especially in a absense of a constraint.
   
   @mbs-octoml ,
   
   ### Short term solution :
   I think you are right, we might want to unblock this using a target-dependent kMaxStackAllocaSize. 
   
   May I ask why  was the default chosen to be this ?
   https://github.com/apache/tvm/blob/1fd8f610953adc39cbd18d82f4a9e92a11575dfc/include/tvm/runtime/device_api.h#L60-L61
   
   Its interesting because the stack size go up beyond that size as it is just looking on a single allocate at a time. i.e. you could have multiple allocates that are less than < 1024. So the stack usage is not even bounded by the current approach.
   
   Therefore, to both unlock us with Ethos-U and also somewhat solve the problem that current micro builds using stack for tensors < 1024 instead of the workspace buffer provided, maybe we should just make kMaxStackAllocaSize=0 (a binary decision rather than a value range).
   
   @Mousius @leandron @areusch , this means there is going to be another argument for a simple micro deployment to be added to the already long list of arguments. Something like "--use-external-workspace" ? 
   
   @tqchen , I still feel it would have been super helpful that kMaxStackAllocaSize is by default zero but with the option of going higher based on a user argument. e.g. --max-stack-alloca-size=1024. It is not very convincing that we are leaving out stack allocation of tensors with the prospect of being optimized by mem2reg without doing any marking (i.e. special storage scope).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] mbs-octoml edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
mbs-octoml edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920485102


   Thanks so much for for the context. I'll try to capture that in a comment.
   This is a 'bug' only in the sense that this heuristic is not working for the EthosU AOT codegen, I think because they are expecting to intercept the workpool ops downstream? But it does suggest a very simple way forward: make kMaxStackAlloca a TargetKind attribute so they can force it to zero.
   @manupa-arm can you chime in here? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921945722


   Thanks @manupa-arm. I agree that putting TVMBAW as a peer to heap is not right(that was meant as an example to demonstrate the viewpoint. I do not  necessary want to enforce heap as a peer to TVMBAW).
   
   For the case of AOT, however. I think it would still be useful to introduce a specialized scope eventually. So we can place the constraint that the memory is accessible by npu explicitly in the IR.
   
   While there is no micro specific target, we can introduce micro specific tags that set these properties. See https://github.com/apache/tvm/blob/main/src/target/tag.cc#L84 as an example of cuda tags
   
   I think all three attributes you proposed could work. A1 might be the simplest for now and it won't preclude A2/A3 options in the future.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920841982


   @manupa-arm in cpu we do not necessarily differentiate local from global for now as they are from the same namespace. 
   
   I can understand the need from the micro side, and I believe this can be resolved by making TVMLowerBuiltin target dependent, and query the max stack alloca property from the target. We can set the value to be zero for micro related targets.   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] mbs-octoml edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
mbs-octoml edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920485102


   Thanks so much for for the context. I'll try to capture that in a comment.
   This is a 'bug' only in the sense that this heuristic is not working for the EthosU AOT codegen, I think because they are expecting to intercept the workpool ops downstream? But it does suggest a very simply way forward: make kMaxStackAlloca a TargetKind attribute so they can force it to zero.
   @manupa-arm can you chime in here? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] mbs-octoml edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
mbs-octoml edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921207626


   plus @csullivan who also needs finer grained scopes 
   Can someone explain how the current Stage scope, which ends up in ProducerRealize nodes, would be coherent with the Allocate scope?
   @manupa-arm if you wanted to unblock EthosU today you could make the kMaxStackAlloca heuristic Target defined. That could buy us time to tackle scopes properly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921220669


   Thanks @manupa-arm . I understand that proposal R4 can also work by having a pass to convert "global" to something more specialize as a pass (essentially R1 and R4 are not that different except for different choices of scope names).
   
   The main question is what is the semantics around the scope "global". Each memory scope represent a "constraint" of what kind of memory it is. 
   
   Right now, when the device type is CPU,  "global" means any memory that can be accessed by the host cpu. This means the actual implement can come from include TVMBAW, memory from stack, or memory allocated by other means. While a memory allocated by TVMBAW can have other benefit(e.g. accessible by other devices because it is pinned), it is not the constraint specified by the "global" scope.
   
   We can of course further constraint the setting, to be say "global.workspace", that reduces the possible ways to allocate the memory, but still not preclude from choosing between multiple workspace buffers.
   
   So from semantics point of view. The pass can indeed choose to return "global" or rewrite to "global.stack" to ensure it is a stack allocation. But if the scope remains "global", we should not preclude the possibility for downstream from allocating from stack(the code generator should be able to choose any kind that satisfies the constraint). To say it in another way, we cannot say that "global" definitely mean no stack allocation.
   
   If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly. 
   
   As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] areusch edited a comment on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
areusch edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-923290583


   apologies for the delay in reply here.
   
   I agree that allocating on the stack is, in the current GraphExecutor world, a codegen-level optimization. In my mind, the introduction of AOTExecutor (and actually also VMExecutor) means that this is moving one level up. Here is why:
   - previously, the maximum effective call depth of a TIR function was 1, meaning that codegen could decide *then and there* that it made sense to move a particular local `tir.allocate` node onto the stack.
   - with the introduction of AOT and control flow VMExecutor, such `tir.allocate` may now live while another TIR function is being invoked e.g. with `tir.call_packed`.
   
   Now that this is the case, it doesn't make sense to make stack-allocation of `tir.allocate` a codegen-level optimization. It must be done prior to codegen using graph-level memory analysis e.g. USMP.
   
   Then question then is how to model this in the program. I think using `storage_scope` does make sense. The definition I've been working with of a `storage_scope` is:
   > A contiguous region of memory, potentially with a maximum size, which can be used by operator implementations on a set of `tvm.Device`.
   
   Under this definition, a special `<context>.stack` `storage_scope` could be created with a target-specific bound on the amount of memory available. I do not think we should create a default `global` scope anywhere, as it's unlikely a global scope is truly ever global except in the case of single-CPU homogenous model execution. We just haven't modeled this yet in TVM, as we don't have an explicit mapping indicating which storage_scope are accessible by which device. The closest thing to a global scope I can think of would be something like `executor` scope, which could be used to place `DLTensor` used by the AOT executor for intermediate tensors.
   
   Something we still have yet to address is how to configure the `storage_scope` made available for a particular Target. I think the Target-specific attributes may not quite cut it, as `llvm` obviously would have different amounts of stack RAM available based on the selected CPU. And, the user will need to be able to override this as well, since sometimes linking with different software libraries reduces the stack RAM available (e.g. some libraries require more global memory and therefore take away from that RAM available for stack usage; or other times, some libraries require massive interrupt stacks and a larger budget must be reserved to mitigate the risk of overwriting global memory due to an interrupt).
   
   It's likely that a scalable solution in micro-land is to have the Project API server able to provide a boilerplate set of `storage_scope` in an e.g. `server_info_query` call; and then it's likely we will need to have a way for users to override this.
   
   Lastly, as to the `<context>` prefix mentioned above: there are many cases when multiple stacks exist on a device:
   - multi-core execution
   - parallel thread execution (e.g. stack viewed as a sort of thread-local storage)
   
   in the former case, it's likely that `<context>` should be replaced with the `name` given to the TargetDevice e.g. in https://github.com/apache/tvm/pull/8892. For example `dsp.stack` or `always-on-cpu.stack`.
   
   in the latter case, we probably additionally need a thread identifier e.g. `dsp.thread0.stack`.
   
   Until we have USMP, my thoughts are that the short-term solution should be to stick with a codegen-level optimization and add an attribute which can be used to disable the stack-allocation optimization. What do you think @tqchen @manupa-arm ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] areusch edited a comment on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
areusch edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-923290583


   apologies for the delay in reply here.
   
   I agree that allocating on the stack is, in the current GraphExecutor world, a codegen-level optimization. In my mind, the introduction of AOTExecutor (and actually also VMExecutor) means that this is moving one level up. Here is why:
   - previously, the maximum effective call depth of a TIR function was 1, meaning that codegen could decide *then and there* that it made sense to move a particular local `tir.allocate` node onto the stack.
   - with the introduction of AOT and control flow VMExecutor, such `tir.allocate` may now live while another TIR function is being invoked e.g. with `tir.call_packed`.
   
   Now that this is the case, it doesn't make sense to make stack-allocation of `tir.allocate` a codegen-level optimization. It must be done prior to codegen using graph-level memory analysis e.g. USMP.
   
   Then question then is how to model this in the program. I think using `storage_scope` does make sense. The definition I've been working with of a `storage_scope` is:
   > A contiguous region of memory, potentially with a maximum size, which can be used by operator implementations on a set of `tvm.Device`.
   
   Under this definition, a special `<context>.stack` `storage_scope` could be created with a target-specific bound on the amount of memory available. I do not think we should create a default `global` scope anywhere, as it's unlikely a global scope is truly ever global except in the case of single-CPU homogenous model execution. We just haven't modeled this yet in TVM, as we don't have an explicit mapping indicating which storage_scope are accessible by which device. The closest thing to a global scope I can think of would be something like `executor` scope, which could be used to place `DLTensor` used by the AOT executor for intermediate tensors.
   
   Something we still have yet to address is how to configure the `storage_scope` made available for a particular Target. I think the Target-specific attributes may not quite cut it, as `llvm` obviously would have different amounts of stack RAM available based on the selected CPU. And, the user will need to be able to override this as well, since sometimes linking with different software libraries reduces the stack RAM available (e.g. some libraries require more global memory and therefore take away from that RAM available for stack usage; or other times, some libraries require massive interrupt stacks and a larger budget must be reserved to mitigate the risk of overwriting global memory due to an interrupt).
   
   It's likely that a scalable solution in micro-land is to have the Project API server able to provide a boilerplate memory layout in an e.g. `server_info_query` call; and then it's likely we will need to have a way for users to override this.
   
   Lastly, as to the `<context>` prefix mentioned above: there are many cases when multiple stacks exist on a device:
   - multi-core execution
   - parallel thread execution (e.g. stack viewed as a sort of thread-local storage)
   
   in the former case, it's likely that `<context>` should be replaced with the `name` given to the TargetDevice e.g. in https://github.com/apache/tvm/pull/8892. For example `dsp.stack` or `always-on-cpu.stack`.
   
   in the latter case, we probably additionally need a thread identifier e.g. `dsp.thread0.stack`.
   
   Until we have USMP, my thoughts are that the short-term solution should be to stick with a codegen-level optimization and add an attribute which can be used to disable the stack-allocation optimization. What do you think @tqchen @manupa-arm ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
masahi commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-1008440167


   Can we close this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921044001


   Thanks @tqchen for summarizing the ideas and presenting possible resolutions.
   
   The two needs seems very valid.
   
   For N0, The operators should really be tagged with 'local' storage scope for the needs of N0 as they are quite local to the operator primfunc and they benefit of further optimizations within and beyond TVM -- i.e. follow up C compiler / LLVM.
   
   For N1, we could use the 'global' tag to give the responsibility for the application/runtime layer to service the allocation.
   
   Therefore, the actual fix should have been tagging the allocates that are expected to be optimized to be 'local' to the PrimFunc, rather than making the 'global' allocates to CPU being treated as local.
   
   > N0 is needed to get best performing kernel, since a native way of allocation
   that can be understood by the compiler would give the best chance for followup
   optimizations. This is the case for CPU related optimizations. Note that such
   optimization is also needed for micro setting, when we use TIR to generate kernel
   code that requested a temp scracth memory for output tiling.
   
   I feel we are incorrectly tagging the storage_scopes here. They should really be 'local' for this specific usecase.
   
   > First of all, R0 and R1 are not that different in nature. Both tries to introduce two separate scopes that brings different behavior. The main questions boils down to how can we name the "global" scope.
   
   In the solution R1, I still that as a workaround for incorrect treatment of 'global' scoped memories where we create an override of an actual 'global' what we declare as 'global.workspace'. In shared memory SoCs, it would be un-scalable explosion of tags if we want to keep tagging memories for devices which have access to them. I would think we'd want to treat memories to devices having a many-to-many relationship.
   
   The lowering we had until few days back was general enough so they were serviced by a runtime/application layer routine and that were more aligned with what we call as 'global' (with respect to codegen) scoped storage.
   
   > Per allocate semantics, we treats "global" as normal CPU memory which can come from stack or platform specific allocation
   
   Can you explain what you define as 'normal CPU memory' ? A CPU can technically have access to many memories.
   
   > However, the need N0 would favor stack allocation when possible. Note that we will likely need a related behavior for micro devices as well when generating operator kernels.
   
   It would have been nice to have a RFC (Apologize in advance if I missed this if there was a one already) to discuss before we move from TVMBAW style allocation which I find more generic than just stack allocations. It almost feel the schedules should have tagged them 'local' if this was the expectation rather than assuming a combined logic : 'global' and CPU.
   
   > One possible approach is to try to ask user to differentiate these two kinds of allocations.
   
   Wouldn't it be simpler if tag allocations for N0 to be 'local' and N1 to be 'global' ?
   
   > N2: Allocating memory with special semantics. For example, an unified device pinned memory that is accessible from both NPU and CPU. A specialized texture memory or shared memory. The request in N2 is quite different and brings additional requirement to how we allocate the memory and how the memory can be used and represented in codegen.
   
   Its a memory where multiple Target have access to which the runtime/application could provide via TVMBAW with a specialized global.<pool_name>.
   
   > It is important to do this because the compiler makes no assumption that "global" can be accessed by other types of devices.
   
   Hmmmm, this argument seems counter-intuitive to me. I think we should assume the 'global' to be accessible unless they are explicitly specified to be restricted. i.e. global.<pool_name>. Otherwise, the terminology is confusing.
   
   > 
       R0: Separate out a "local" scope that carries the stack allocation heavior. (proposed by @manupa-arm )
       R1: Keep "global" scope as it is, introduce a special tagged scope "global.workspace"
       that represents a global memory request specifically for workspace memory.
       And introduce lowerings for them. For specialized memory(e.g. NPU unified), introduce
       separate memory scopes.
       R3: Introduce target specific attribute that marks the possible stack alloca size for
       lowering the R0("local") or R1("global"). Note R3 can be done in addition to R0 or R1.
   
   Ideally, the allocates destined to be end up in stack should have been 'local'. Moreover, at which point we can decide not to tag allocates that exceed the target specific attribute rather than dealing this in the codegen or lower_builtin_tvm pass.
   
   I would propose :
   
   R4 : R0 + I think its cleaner if we just introduce a optional pass to tag memories with 'local'. At which point, we should only tag them according to the target specific attribute max stack alloca size -- that could work until we fix the schedules that wants them in stack to have 'local' storage scope for the need N0 -- that is a conscious decision the schedule writer / autoscheduler takes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-924154985


   P1 : #9065 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921820526


   Please allow me to explain the overall rationale here, in particular over the term "constraint"
   
   - C0: On one hand, we want a "default" memory to be generically accessible (per @manupa-arm 's comment) in all cases so runtime libraries can be built to leverage the generic property(e.g. access from NPU).
   - C1: On the other hand, from compiler's POV, we want to leave flexibility to code optimizations and codegen phase, and only constraint on the property we want (e.g. accessible from CPU)
   
   ## The two ways to see the constrainting
   
   The way C0 sees the constaint is about the possible accessor of the memory
   - "global"=> memory can be accessed from {cpu, npu, other devices}
   - "global.stack" => memory can be accessed from { cpu }
   
   The way C1 sees the constraint is about possible memories to choose from.
   - "global"(memory that is accessible from CPU) => can choose from {stack, heap, TVMBAW}
   - "global.workspace" => can choose from  { TVMBAW }
   - "global.stack" => can choose from  { stack }
   
   ## Discussions
   
   When we say a compiler IR is more constrainted than another one. Usually we mean that less optimizations can be performed, because there is lack of flexibility in terms of rewriting. For example, `volatile` keyword puts additional constraints on memory access.
   
   This makes C1 more aligned in the common compiler IR design. Note that "memory that is accessible from all devices" is term that depends on the specific runtime platform, and not very well defined in a generic IR. 
   The more constraints we put on the memory itself, the smaller set it can become. As a result, there are less opportunities
   of code transformations and optimizations.
   
   Under the current semantics  "CPU" && "global" can result in stack allocation. Note that is is one kind of flexibility we want to offer to later stages so that specializations can be made.  
   
   - So yes it is indeed OK for a pass to map "global" to TVMBAW, the resulting program will run slower, but still correctly on CPU. 
   - It also does not precldue TVMLowerBuiltin to take benefit of the semantics to choose stack allocation, which usually benefit performance. 
   
   One thing we should keep in mind is that the codegen and AOT compiler should not rely on the behavior of TVMLowerBuiltin to ensure correctness(since it can choose to do anything in the case of "global", including dispatching to another custom allocator). If a special kind of memory is needed, we should declare such constraint through IR. Attaching a special scope is the best way to do so under the current semantics, regardless of the implementation of TVMLowerBuiltin.
   
   TVMLowerBuiltin picks `kMaxStackAllocaSize` as a heuristic number that maximizes the benefit of stack allocation without overexploding the stack. Of course a better heuristic can be used, setting it default to 0 would bring down performance of a lot of CPU code so not as desirable. We could certainly have a target dependent property for micro target and set that to 0.
   
   Please let me know if it clarifies things
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921820526


   Please allow me to explain the overall rationale here, in particular over the term "constraint"
   
   - C0: On one hand, we want a "default" memory to be generically accessible (per @manupa-arm 's comment) in all cases so runtime libraries can be built to leverage the generic property(e.g. access from NPU).
   - C1: On the other hand, from compiler's POV, we want to leave flexibility to code optimizations and codegen phase, and only constraint on the property we want (e.g. accessible from CPU)
   
   ## The two ways to see the constrainting
   
   The way C0 sees the constaint is about the possible accessor of the memory
   - "global"=> memory can be accessed from {cpu, npu, other devices}
   - "global.stack" => memory can be accessed from { cpu }
   
   The way C1 sees the constraint is about possible memories to choose from.
   - "global"(memory that is accessible from CPU) => can choose from {stack, heap, TVMBAW}
   - "global.workspace" => can choose from  { TVMBAW }
   - "global.stack" => can choose from  { stack }
   
   ## Discussions
   
   When we say a compiler IR is more constrainted than another one. Usually we mean that less optimizations can be performed, because there is lack of flexibility in terms of rewriting. For example, `volatile` keyword puts additional constraints on memory access.
   
   This makes C1 more aligned in the common compiler IR design. Note that "memory that is accessible from all devices" is term that depends on the specific runtime platform, and not very well defined in a generic IR. 
   The more constraints we put on the memory itself, the smaller set it can become. As a result, there are less opportunities
   of code transformations and optimizations.
   
   Under the current semantics  "CPU" && "global" can result in stack allocation. Note that is is one kind of flexibility we want to offer to later stages so that specializations can be made.  
   
   - So yes it is indeed OK for a pass to map "global" to TVMBAW, the resulting program will run slower, but still correctly on CPU. 
   - It also does not precldue TVMLowerBuiltin to take benefit of the semantics to choose stack allocation, which usually benefit performance. 
   
   One thing we should keep in mind is that the codegen and AOT compiler should not rely on the behavior of TVMLowerBuiltin to ensure correctness(since it can choose to do anything in the case of "global", including dispatching to another custom allocator). If a special kind of memory is needed, we should declare such constraint through IR. Attaching a special scope is the best way to do so under the current semantics, regardless of the implementation of TVMLowerBuiltin.
   
   
   TVMLowerBuiltin picks `kMaxStackAllocaSize` as a heuristic number that maximizes the benefit of stack allocation without overexploding the stack. Of course a better heuristic can be used, setting it default to 0 would bring down performance of a lot of CPU code so not as desirable. We could certainly have a target dependent property for micro target and set that to 0. It should be pretty easy to do as well, see https://github.com/apache/tvm/blob/main/src/tir/transforms/lower_warp_memory.cc#L392 that obtains a target dependent warp size 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
masahi commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921412813


   > Can someone explain how the current Stage scope, which ends up in ProducerRealize nodes, would be coherent with the Allocate scope?
   
   @mbs-octoml The PR https://github.com/apache/tvm/pull/8366 I did might answer your question. The storage scope attached to `ProduceRealizeNode` is used to create `PointerType` and `Buffer`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920882592


   Thanks for the discussions. Before we suggest a resolution, it would be helpful
   to summarize the discussions so far. 
   
   ## Semantics of Allocate and storage_scope
   
   Allocate at the moment does not specify the way to allocate the memory.
   The lowering phase can feel free to pick a runtime  dependent one or a 
   more platform native one(e.g. stack alloca or direct declaration of shared memory).
   
   In practie, we see two different kinds of needs:
   - N0: The need for optimizing "device" level code.
   - N1: The need to be able to allocate memory from the host side, possibly
     in a runtime dependent way.
   
   N0 is needed to get best performing kernel, since a native way of allocation 
   that can be understood by the compiler would give the best chance for followup
   optimizations. This is the case for CPU related optimizations. Note that such
   optimization is also needed for micro setting, when we use TIR to generate kernel
   code that requested a temp scracth memory for output tiling.
   
   N1 is needed for "driving" the set of computations. In the case of GPU, we have
   to rely on host side driver function. In the case of micro AOT, it is desirable
   to allocate using platform specific memory.
   
   One possible approach is to try to ask user to differentiate these two kinds of
   allocations. However, to give user maximum convenience, TVM allows a mixed host/device
   programming model like CUDA and hide that complexity from the user. The boundary
   of N0 and N1 can also be blurred, e.g. a kernel can request to allocate from platform
   dependent memory.
   
   In specific specialized target devices, we also see an additional need:
   
   - N2: Allocating memory with special semantics. 
   
   For example, an unified device pinned memory that is accessible from both NPU and CPU. 
   A specialized texture memory or shared memory. The request in N2 is quite different 
   and brings additional requirement to how we allocate the memory and how the memory can be 
   used and represented in codegen.
   
   Right now N2 can be covered by introducing a special memory tag -- "global.npu_unified".
   This will usually allow the TVM to know that the memory needed to be lowered by possibly
   special target dependent passes, and leave them out. It is important to do this because 
   the compiler makes no assumption that "global" can be accessed by other types of devices.
   
   ## Possible Resolutions
   
   Based on discussions so far, there are two suggested resolutions.
   
   - R0: Separate out a "local" scope that carries the stack allocation heavior. (proposed by @manupa-arm )
   - R1: Keep "global" scope as it is, introduce a special tagged scope "global.workspace"
     that represents a global memory request specifically for workspace memory. 
     And introduce lowerings for them. For specialized memory(e.g. NPU unified), introduce
     separate memory scopes. 
   - R3: Introduce target specific attribute that marks the possible stack alloca size for 
     lowering the R0("local") or R1("global"). Note R3 can be done in addition to R0 or R1.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921134533


   Thanks @manupa-arm . Trying to capture some of the discussions.
   
   - Right now the "global" scope translate to something that can be accessed by CPU, and there was no requirement of allocating from TVMBAW. This is consistent with allocate node for other kinds of scopes as well, such as shared memory in a GPU. 
   -  I can understand the general proposal R4: "global" to "local" would introduce a semantic change of the "global" scope, and can cause backward compatibility issues.
   
   The main reason I would suggest R1 is because this is a backward compatible change(no changes to topi and other libraries needed). Additionally, it reduces the users' mental cost overall to think and choose another kind of memory.
   While I can totally get the other type of interpretation as well(by defining "local" to be stack allocation and "global" to be absolutely not), it is not the current intended semantics.
   
   Note that introducing an explicit tag for TVMBAW via "global.workspace" won't necessarily mean we need to introduce tags for all kinds of memory. It does places constraints on how this kind of memory can be allocated. So things can work out of box
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920882592


   Thanks for the discussions. Before we suggest a resolution, it would be helpful
   to summarize the discussions so far. 
   
   ## Semantics of Allocate and storage_scope
   
   Allocate at the moment does not specify the way to allocate the memory.
   The lowering phase can feel free to pick a runtime  dependent one or a 
   more platform native one(e.g. stack alloca or direct declaration of shared memory).
   
   In practie, we see two different kinds of needs:
   - N0: The need for optimizing "device" level code.
   - N1: The need to be able to allocate memory from the host side, possibly
     in a runtime dependent way.
   
   N0 is needed to get best performing kernel, since a native way of allocation 
   that can be understood by the compiler would give the best chance for followup
   optimizations. This is the case for CPU related optimizations. Note that such
   optimization is also needed for micro setting, when we use TIR to generate kernel
   code that requested a temp scracth memory for output tiling.
   
   N1 is needed for "driving" the set of computations. In the case of GPU, we have
   to rely on host side driver function. In the case of micro AOT, it is desirable
   to allocate using platform specific memory.
   
   One possible approach is to try to ask user to differentiate these two kinds of
   allocations. However, to give user maximum convenience, TVM allows a mixed host/device
   programming model like CUDA and hide that complexity from the user. The boundary
   of N0 and N1 can also be blurred, e.g. a kernel can request to allocate from platform
   dependent memory.
   
   In specific specialized target devices, we also see an additional need:
   
   - N2: Allocating memory with special semantics. 
   
   For example, an unified device pinned memory that is accessible from both NPU and CPU. 
   A specialized texture memory or shared memory. The request in N2 is quite different 
   and brings additional requirement to how we allocate the memory and how the memory can be 
   used and represented in codegen.
   
   Right now N2 can be covered by introducing a special memory tag -- "global.npu_unified".
   This will usually allow the TVM to know that the memory needed to be lowered by possibly
   special target dependent passes, and leave them out. It is important to do this because 
   the compiler makes no assumption that "global" can be accessed by other types of devices.
   
   ## Possible Resolutions
   
   Based on discussions so far, there are two suggested resolutions.
   
   - R0: Separate out a "local" scope that carries the stack allocation heavior.
   - R1: Keep "global" scope as it is, introduce a special tagged scope "global.workspace"
     that represents a global memory request specifically for workspace memory. 
     And introduce lowerings for them. For specialized memory(e.g. NPU unified), introduce
     separate memory scopes. 
   - R3: Introduce target specific attribute that marks the possible stack alloca size for 
     lowering the R0("local") or R1("global"). Note R3 can be done in addition to R0 or R1.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920896447


   So in the above post I tried to summarize the state. Now let me try to share some of my thoughts based on the summary.
   
   First of all, R0 and R1 are not that different in nature. Both tries to introduce two separate scopes that brings different behavior. The main questions boils down to how can we name the "global" scope.
   
   Per allocate semantics, we treats "global" as normal CPU memory which can come from stack or platform specific allocation. The system can choose the best way of doing such lowering. Always lowering to TBAW is indeed more general for the need of N1. However, the need N0 would favor stack allocation when possible. Note that we will likely need a related behavior for micro devices as well when generating operator kernels.
   
   While it is OK to differentiate stack allocated memory from a platform specific one, doing so would bring additional burden to the user and would require significant refactor of the operator implementations.
   
   The main requests so far comes from need of N1. In that case, it would be easy for AOT generator to allocate memory with special tags("global.workspace"), that enforces workspace allocation since in this setting there is a single expected behavior.
   
   So my suggestion would be R1+R2, as it helps to resolve the need in a way that is compatible with the current semantics and usecases. It will also open doors for more future scope dependent optimizations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920841982


   @manupa-arm in cpu we do not necessarily differentiate local from global for now as they are from the same namespace. 
   
   I can understand the need from the micro side, and I believe this can be resolved by making TVMLowerBuiltin target dependent, and query the max stack alloca property from the target. We can set the value to be zero for micro related targets.   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] mbs-octoml commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
mbs-octoml commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921207626


   + @csullivan who also needs finer grained scopes 
   Can someone explain how the current Stage scope, which ends up in ProducerRealize nodes, would be coherent with the Allocate scope?
   @manupa-arm if you wanted to unblock EthosU today you could make the kMaxStackAlloca heuristic Target defined. That could buy us time to tackle scopes properly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921220669


   Thanks @manupa-arm . I understand that proposal R4 can also work by having a pass to convert "global" to something more specialize as a pass (essentially R1 and R4 are not that different except for different choices of scope names).
   
   The main question is what is the semantics around the scope "global". Each memory scope represent a "constraint" of what kind of memory it is. 
   
   Right now, when the device type is CPU,  "global" means any memory that can be accessed by the host cpu. This would include TVMBAW, memory from stack, or memory allocated by other means. While a memory allocated by TVMBAW can have other benefit(e.g. accessible by other devices because it is pinned), it is not the constraint specified by the "global" scope.
   
   We can of course further constraint the setting, to be say "global.workspace", that reduces the possible ways to allocate the memory, but still not preclude from choosing between multiple buffers.
   
   So from semantics point of view. The pass can indeed choose to return "global" or rewrite to "global.stack" to ensure it is a stack allocation. But if the scope remains "global", we should not preclude the possibility for downstream from allocating from stack(the code generator should be able to choose any kind that satisfies the constraint). 
   
   If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly. 
   
   As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921134533


   Thanks @manupa-arm . Trying to capture some of the discussions.
   
   - Right now the "global" scope translate to something that can be accessed by CPU, and there was no requirement of allocating from TVMBAW. This is consistent with allocate node for other kinds of scopes as well, such as shared memory in a GPU. 
   -  I can understand the general proposal R4: "global" to "local" would introduce a semantic change of the "global" scope, and can cause backward compatibility issues.
   
   The main reason I would suggest R1 is because this is a backward compatible change(no changes to topi and other libraries needed). Additionally, it reduces the users' mental cost overall to think and choose another kind of memory.
   While I can totally get the other type of interpretation as well(by defining "local" to be stack allocation and "global" to be absolutely not), it is not the current intended semantics.
   
   Note that introducing an explicit tag for TVMBAW via "global.workspace" won't necessarily mean we need to introduce tags for all kinds of memory. It does places constraints on how this kind of memory can be allocated. So things can work out of box
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-923686813


   So there are two sub issues here, if you follow the discussion : 
   
   P1: The AoT executor codegen creates tir.allocates that should be served TVMBAW calls more generally.
   P2: The CPU PrimFuncs generating stack allocation for some allocates based on a per-allocate heuristic.
   
   For P1,
   We are working on a PR (to be published soon) make the AoT executor codegen create tir.allocate with 'global.workspace'  (as discussed above) storage scopes to make sure they get lowered to TVMBAWs that are device/memory agnostic calls from a codegen perspective.
   
   For P2,
   We might need to control stack-alloca heuristic for each compilation.
   
   I feel that should solve the issues in the scope.
   
   **@areusch shall we move the discussion to another issue/discuss post above promoting/introducing storage_scope that is more higher than the current definition of 'global'?**
   
   I agree with you that the current definition of  'global' is a misnomer. There are certain memories (e.g. stack) that could be 'owned' by a TargetDevice -- in which case we could do \<TargetDevice\>.stack.  In a way, stack is special as it does not create an opaque call to get hold of the memory, so that LLVM (or any other post-TVM compiler) could do optimizations. Hence, my emphasis on treating it especially with 'local' storage_scope. However, I think it would be too big of a change given the historical precedence of TVM, as @tqchen points out.
   
   However, in shared memory SoCs, the ownership of memories to TargetDevice will not scale very well. For e.g., if we have a SRAM that is accessed by  CPU, DSP, and NPU in the same SoC, what should be the prefix? The current approach we are taking (and discussed in the RFC) of USMP is to use the storage_scope: global.<pool_name> where pool_name points to PoolInfo that describes which TargetDevice have access to it. Therefore, I believe we could make an Arena that is pinned to the memories in the runtime layer to serve each Pool via TVMBAWs. If USMP is invoked, the static-sized allocations will be translated to fixed offsets within a static buffer pinned to the same memories. I guess my point is the provision of Arenas/buffers should be lifted out of the codegen to the application/runtime layer and interfaced by generated codes via TVMBAWs (for non-static sized allocates) or offset (for static-sized allocates).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-1019923894


   P2 : #9950 
   Thanks @SebastianBoblestETAS 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920463412


   @mbs-octoml To give a bit of context
   
   In the context of CPU, we want to preserve small alloca until the code generation point. And then the code will generate the stack alloca in an explicit way. Only when memory is big enough(bigger than a constant), we will use an opaque allocation instead.
   
   Stack allocation is important for the performance of the CPU code. In the case of TVM, we do not have explicit concept of registers in most cases.  Instead we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. So removing this code path can complicate the code generator side optimization by quite a bit and slow down the CPU code.
   
   Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code.
   
   - Allocate always prefers (native) stack allocation when possible, but also allows other means of opaque allocation(as long as the allocation is fulfilled)
   - There are however, cases when stack allocation is not possible 
       - When the size of memory requested is too big, stack alloca will explode the stack space(That is why there is a size check in the CPU case and the use of global opaque was meant as a fallback to avoid stackoverflow in models with big intermediate temp space)
       - LowerTVMBuiltin was originally designed to run on the host side, which means as soon as the allocation is about device side memory, it will need to call onto a (host side) device API to allocate the memory instead
   
   
   So rationales for the specific CPU side logic:
   - We want to have stack alloca on host when possible(to gain mem2reg optimization)
   - When the requested size is too large, we fallback to opaque workspace allocation on heap to allow the code to safely handle code with big temp memory requests as well as dynamic size allocation requests.
   
   My guess is we need to look into why VM cannot work with code that allocates on stack in the multiple target case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921782373


   @tqchen ,
   
   > To say it in another way, we cannot say that "global" definitely mean no stack allocation.
   
   The current issue is in the device "CPU" && 'global' certainly means its definitely stack allocation if its less that heuristic size and not the other way around.
   
    > If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly.
   
   > As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory
   
   Yes this is something we want eventually and we will be working towards achieving with USMP work.
   
   Until we have that, the natural assumption should be in absense of a 'constraint' that the memories are more accessible rather than being less accessible (e.g. stack). Its unfortunate that the current design prefers the latter especially in a absense of a constraint.
   
   @mbs-octoml ,
   
   ### Short term solution :
   I think you are right, we might want to unblock this using a target-dependent kMaxStackAllocaSize. 
   
   May I ask why  was the default chosen to be this ?
   https://github.com/apache/tvm/blob/1fd8f610953adc39cbd18d82f4a9e92a11575dfc/include/tvm/runtime/device_api.h#L60-L61
   
   Its interesting because the stack size go up beyond that size as it is just looking on a single allocate at a time. i.e. you could have multiple allocates that are less than < 1024. So the stack usage is not even bounded by the current approach.
   
   Therefore, to both unlock us with Ethos-U and also somewhat solve the problem that current micro builds using stack for tensors < 1024 instead of the workspace buffer provided, maybe we should just make kMaxStackAllocaSize=0 (a binary decision rather than a value range).
   
   @Mousius @leandron @areusch , this means there is going to be another argument for a simple micro deployment to be added to the already long list of arguments. Something like "--use-external-workspace" ? 
   
   @tqchen , I still feel it would have been super helpful that kMaxStackAllocaSize is by default zero but with the option of going higher based on a user argument. e.g. --max-stack-alloca-size=1024. It is not very convincing that we are leaving out stack allocation of tensors with the prospect of being optimized by mem2reg without doing any marking (i.e. special storage scope) in a somewhat global form.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921412813


   > Can someone explain how the current Stage scope, which ends up in ProducerRealize nodes, would be coherent with the Allocate scope?
   
   @mbs-octoml The PR https://github.com/apache/tvm/pull/8366 I did might answer your question. The storage scope attached to `ProduceRealizeNode` is used to create `PointerType` and `Buffer`. `Allocate` node query the storage scope associated with its buffer `PointerType`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920580279


   Our proposal is to add a check to that loop whether it has 'local' storage_scope before we place them into the stack as it is the solution that works for the wider definition of the CPU, rather than performing a hidden optimization in the codegen that is applicable for a subset of CPUs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920577170


   @tqchen @mbs-octoml ,
   
   This is not specific to Arm(R) Ethos(TM)-U codegen and its generally applicable for any micro controller where we would want to avoid creating allocation of memories in the stack but rather service them via platform abstraction that is handled via TVMBackendWorkspaceAlloc --> TVMPlatformAllocate.
   
   This only showed up in Arm(R) Ethos(TM)-U codegen because we use TVMPlatformAllocate allocate memory from a buffer placed in a memory that is both accessible by CPU and NPU. Thus, it makes this a functional bug.
   However, with this change current main produces code that have much higher stack allocation for micros -- that is not desired.
   
   cc : @u99127  @areusch 
   
   > Stack allocation is important for the performance of the CPU code. In the case of TVM, we do not have explicit concept of registers in most cases. Instead we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. So removing this code path can complicate the code generator side optimization by quite a bit and slow down the CPU code.
   
   The correct way represent this seems to be using tir.allocates with storage_scope="local" for device=CPU to go into the stack. For targets that needs this behavior, there should be an explicit pass to convert them to local to make them placed to the stack rather than assuming this default behaviour.
   
   > Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code.
   
   >    Allocate always prefers (native) stack allocation when possible, but also allows other means of opaque allocation(as long as the allocation is fulfilled)
       There are however, cases when stack allocation is not possible
           When the size of memory requested is too big, stack alloca will explode the stack space(That is why there is a size check in the CPU case and the use of global opaque was meant as a fallback to avoid stackoverflow in models with big intermediate temp space)
           LowerTVMBuiltin was originally designed to run on the host side, which means as soon as the allocation is about device side memory, it will need to call onto a (host side) device API to allocate the memory instead
   
   Clearly, this definition of CPU leaves out micros.
   It feels wrong to print out allocates with "global" storage_scope directly into CPU PrimFunc that gets printed as a stack allocation rather it should be serviced via TVMBAW call which moves the responsibility for the runtime/application layer.
   
   > So rationales for the specific CPU side logic:
   
    >   We want to have stack alloca on host when possible(to gain mem2reg optimization)
       When the requested size is too large, we fallback to opaque workspace allocation on heap to allow the code to safely handle code with big temp memory requests as well as dynamic size allocation requests.
   
   This certainly sounds like we could use an optimization pass to convert the tir.allocate's storage_scope for targets that requires this rather than making that the default behaviour for tir.allocates with "global" storage scope.
   
   cc : @tom-gall @mbaret @Mousius 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] areusch commented on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
areusch commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-923290583


   apologies for the delay in reply here.
   
   I agree that allocating on the stack is, in the current GraphExecutor world, a codegen-level optimization. In my mind, the introduction of AOTExecutor (and actually also VMExecutor) means that this is moving one level up. Here is why:
   - previously, the maximum effective call depth of a TIR function was 1, meaning that codegen could decide *then and there* that it made sense to move a particular local `tir.allocate` node onto the stack.
   - with the introduction of AOT and control flow VMExecutor, such `tir.allocate` may now live while another TIR function is being invoked e.g. with `tir.call_packed`.
   
   Now that this is the case, it doesn't make sense to make stack-allocation of `tir.allocate` a codegen-level optimization. It must be done prior to codegen using graph-level memory analysis e.g. USMP.
   
   Then question then is how to model this in the program. I think using `storage_scope` does make sense. The definition I've been working with of a `storage_scope` is:
   > A contiguous region of memory, potentially with a maximum size, which can be used by operator implementations on a set of `tvm.Device`.
   
   Under this definition, a special "<context>.stack" `storage_scope` could be created with a target-specific bound on the amount of memory available. I do not think we should create a default `global` scope anywhere, as it's unlikely a global scope is truly ever global except in the case of single-CPU homogenous model execution. We just haven't modeled this yet in TVM, as we don't have an explicit mapping indicating which storage_scope are accessible by which device. The closest thing to a global scope I can think of would be something like `executor` scope, which could be used to place `DLTensor` used by the AOT executor for intermediate tensors.
   
   Something we still have yet to address is how to configure the `storage_scope` made available for a particular Target. I think the Target-specific attributes may not quite cut it, as `llvm` obviously would have different amounts of stack RAM available based on the selected CPU. And, the user will need to be able to override this as well, since sometimes linking with different software libraries reduces the stack RAM available (e.g. some libraries require more global memory and therefore take away from that RAM available for stack usage; or other times, some libraries require massive interrupt stacks and a larger budget must be reserved to mitigate the risk of overwriting global memory due to an interrupt).
   
   It's likely that a scalable solution in micro-land is to have the Project API server able to provide a boilerplate memory layout in an e.g. `server_info_query` call; and then it's likely we will need to have a way for users to override this.
   
   Lastly, as to the `<context>` prefix mentioned above: there are many cases when multiple stacks exist on a device:
   - multi-core execution
   - parallel thread execution (e.g. stack viewed as a sort of thread-local storage)
   
   in the former case, it's likely that `<context>` should be replaced with the `name` given to the TargetDevice e.g. in https://github.com/apache/tvm/pull/8892. For example `dsp.stack` or `always-on-cpu.stack`.
   
   in the latter case, we probably additionally need a thread identifier e.g. `dsp.thread0.stack`.
   
   Until we have USMP, my thoughts are that the short-term solution should be to stick with a codegen-level optimization and add an attribute which can be used to disable the stack-allocation optimization. What do you think @tqchen @manupa-arm ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-923310594


   One thing I want to note here is that the prefix(global/shared/warp/local) have things to do with thread execution hierachy so we cannot add arbitrary prefix here. we can however, add any tags that follows.
   
   In most cases the storage is assumed local to the execution context(of the thread), so we do not need to annotate threading as part of the storage scope.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] areusch commented on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
areusch commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-924491151


   okay @manupa-arm this short-term solution seems workable to unblock the Ethos-U merge. I agree we should revisit how to solve this more generally in AOT once we have USMP and TargetDevice merged. we can discuss this in a separate post.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920463412


   @mbs-octoml To give a bit of context
   
   In the context of CPU, we want to preserve small alloca until the code generation point. And then the code will generate the stack alloca in an explicit way. Only when memory is big enough(bigger than a constant), we will use an opaque allocation instead.
   
   Stack allocation is important for the performance of the CPU code. In the case of TVM, we do not have explicit concept of registers in most cases.  Instead we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. So removing this code path can complicate the code generator side optimization by quite a bit and slow down the CPU code.
   
   Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code.
   
   - Allocate always prefers (native) stack allocation when possible, but also allows other means of opaque allocation(as long as the allocation is fulfilled)
   - There are however, cases when stack allocation is not possible 
       - When the size of memory requested is too big, stack alloca will explode the space(That is why there is a size check in the CPU case)
       - LowerTVMBuiltin was originally designed to run on the host side, which means as soon as the allocation is about device side memory, it will need to call onto a (host side) device API to allocate the memory instead
   
   So in this case the special case logic for CPU was intentionally and my guess is we need to look into why VM cannot work with code that allocates on stack in the multiple target case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920463412


   @mbs-octoml To give a bit of context
   
   In the context of CPU, we want to preserve small alloca until the code generation point. And then the code will generate the stack alloca in an explicit way. Only when memory is big enough(bigger than a constant), we will use an opaque allocation instead.
   
   Stack allocation is important for the performance of the CPU code. In the case of TVM, we do not have explicit concept of registers in most cases.  Instead we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. So removing this code path can complicate the code generator side optimization by quite a bit and slow down the CPU code.
   
   Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code.
   
   - Allocate always prefers (native) stack allocation when possible, but also allows other means of opaque allocation(as long as the allocation is fulfilled)
   - There are however, cases when stack allocation is not possible 
       - When the size of memory requested is too big, stack alloca will explode the stack space(That is why there is a size check in the CPU case and the use of global opaque was meant as a fallback to avoid stackoverflow in big models)
       - LowerTVMBuiltin was originally designed to run on the host side, which means as soon as the allocation is about device side memory, it will need to call onto a (host side) device API to allocate the memory instead
   
   So in this case the special case logic for CPU was intentionally and my guess is we need to look into why VM cannot work with code that allocates on stack in the multiple target case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920486224


   Right, this the gets to the target dependent generation regime where TargetKind attribute is indeed the right solution. We should also send a PR to add comments to that code block so we could have more context in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921220669


   Thanks @manupa-arm . I understand that proposal R4 can also work by having a pass to convert "global" to something more specialize as a pass (essentially R1 and R4 are not that different except for different choices of scope names).
   
   The main question is what is the semantics around the scope "global". Each memory scope represent a "constraint" of what kind of memory it is. 
   
   Right now, when the device type is CPU,  "global" means any memory that can be accessed by the host cpu. This would include TVMBAW, memory from stack, or memory allocated by other means. While a memory allocated by TVMBAW can have other benefit(e.g. accessible by other devices because it is pinned), it is not the constraint specified by the "global" scope.
   
   We can of course further constraint the setting, to be say "global.workspace", that reduces the possible ways to allocate the memory, but still not preclude from choosing between multiple buffers.
   
   So from semantics point of view. The pass can choose to return "global" or rewrite to "global.stack", but if the scope remains "global", we should not preclude the possibility for downstream from allocating from stack(the code generator should be able to choose any kind that satisfies the constraint). 
   
   If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly. 
   
   As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921220669


   Thanks @manupa-arm . I understand that proposal R4 can also work by having a pass to convert "global" to something more specialize as a pass (essentially R1 and R4 are not that different except for different choices of scope names).
   
   The main question is what is the semantics around the scope "global". Each memory scope represent a "constraint" of what kind of memory it is. 
   
   Right now, when the device type is CPU,  "global" means any memory that can be accessed by the host cpu. This means the actual implement can come from include TVMBAW, memory from stack, or memory allocated by other means. While a memory allocated by TVMBAW can have other benefit(e.g. accessible by other devices because it is pinned), it is not the constraint specified by the "global" scope.
   
   We can of course further constraint the setting, to be say "global.workspace", that reduces the possible ways to allocate the memory, but still not preclude from choosing between multiple workspace buffers.
   
   So from semantics point of view. The pass can indeed choose to return "global" or rewrite to "global.stack" to ensure it is a stack allocation. But if the scope remains "global", we should not preclude the possibility for downstream from allocating from stack(the code generator should be able to choose any kind that satisfies the constraint). 
   
   If the code needs to impose additional constraint that the memory must be accessible from a separate device(e.g. NPU), it certainly would require a more specialized constraint that is better spelled out explicitly. 
   
   As we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations without over constraining the code generator to a backend specific behavior that is platform dependent (like the case of pinned memory)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921934672


   Thanks @tqchen for the explanation of two viewpoints of how we could see the constraints.
   
   I do agree that we should put more constraints progressively to guide the compiler in the lowering. 
   
   It is just that the treatment of TVMBAW as a peer to heap and stack seemed not right. In fact one could use TVMBAW to perform heap allocation. My concern was that we are taking a decision way down in the compilation flow where it could have been taken transparently in the IR itself a bit higher.
   
   I think we are moving there with scopes but it would have been nicer to stick to TVMBAW for now as it is the more general lowering for 'global' and I would not see necessarily that as an orthogonal choice to the list you have mentioned. It just boils to the fact that we just want them to be on stack for mem2reg optimizations. While I dont disagree with the logic of the argument, but wish it was more explicit higher up in the compilation flow. If it was not for mem2reg optimizations, one could simply provide a Arena that could provide the workspace required as it would from the stack -- thus it seemed to me like a better interrim solution until we specialize the scope in the lowering.
   
   Yes, as you suggested the implementation of the target-dependent query for the max alloca size is not particularly challenging, it is just the API that we provide for the user is what we were worried about. This is important especially "micro" is not a target really in TVM -- so the short term solution seems like we would need one of the following TargetKind attributes for C and LLVM backends : 
   
   A1 :  --max-stack-alloca-size
   A2 :  --max-stack-size
   A3 :  --use-backend-alloc-workspace
   
   So here ideally from UX point of view, it would be better to give the user A2 argument rather than a constraining the size of single allocation seems like a proxy to control the behaviour of mem2reg transformations. 
   
   A3 on the other hand set kMaxStackAllocaSize to zero and forcing all the allocates to be serviced by TVMBAW which could be from heap or Arena placed anywhere controlled by the runtime/application.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920896447


   So in the above post I tried to summarize the state. Now let me try to share
   some of my thoughts based on the summary.
   
   First of all, R0 and R1 are not that different in nature. Both tries to introduce
   two separate scopes that brings different behavior. The main questions boils down
   to how can we name the "global" scope.
   
   Per allocate semantics, we treats "global" as normal CPU memory which can come
   from stack or platform specific allocation. The system can choose the best way
   of doing such lowering. However, memory that is accessible from NPU is something
   that is more specialized and could use a special memory tag for differentiation
   purposes.
   
   While it is OK to differentiate stack allocated memory from a platform specific one,
   doing so would bring additional burden to the user and would require significant
   refactor of the operator implementations.
   
   Note that we will likely need a related behavior for micro devices as well
   in the need of N0. The main requests so far comes from need of N1. In that case,
   it would be easy for AOT generator to allocate memory with special tags("global.workspace"),
   that enforces workspace allocation since in this setting there is a single expected behavior.
   
   So my suggestion would be R1+R2, as it helps to resolve the need in a way that is compatible
   with the current semantics and usecases. It will also open doors for more future scope dependent optimizations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] mbs-octoml edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
mbs-octoml edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920485102


   Thanks so much for for the context. I'll try to capture that in a comment.
   This is a 'bug' only in the sense that this heuristic is not working for the EthosU AOT codegen, I think because they are expecting to intercept the workpool ops downstream? But it does suggest a very simply way forward: make kMaxAlloca size a TargetKind attribute so they can force it to zero.
   @manupa-arm can you chime in here? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920463412


   @mbs-octoml I believe the current behavior is intended. 
   
   In the context of CPU, we want to preserve small alloca until the code generation point. And then the code will generate the stack alloca in an explicit way. Only when memory is big enough(bigger than a constant), we will use an opaque allocation instead.
   
   
   Stack allocation is important for the prformance of the CPU code, because we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920463412


   @mbs-octoml To give a bit of context
   
   In the context of CPU, we want to preserve small alloca until the code generation point. And then the code will generate the stack alloca in an explicit way. Only when memory is big enough(bigger than a constant), we will use an opaque allocation instead.
   
   Stack allocation is important for the performance of the CPU code. In the case of TVM, we do not have explicit concept of registers in most cases.  Instead we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. So removing this code path can complicate the code generator side optimization by quite a bit and slow down the CPU code.
   
   Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code.
   
   - Allocate always prefers (native) stack allocation when possible, but also allows other means of opaque allocation(as long as the allocation is fulfilled)
   - There are however, cases when stack allocation is not possible 
       - When the size of memory requested is too big, stack alloca will explode the stack space(That is why there is a size check in the CPU case and the use of global opaque was meant as a fallback to avoid stackoverflow in models with big intermediate temp space)
       - LowerTVMBuiltin was originally designed to run on the host side, which means as soon as the allocation is about device side memory, it will need to call onto a (host side) device API to allocate the memory instead
   
   So in this case the special case logic for CPU was intentionally and my guess is we need to look into why VM cannot work with code that allocates on stack in the multiple target case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
masahi edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921412813


   > Can someone explain how the current Stage scope, which ends up in ProducerRealize nodes, would be coherent with the Allocate scope?
   
   @mbs-octoml The PR https://github.com/apache/tvm/pull/8366 I did might answer your question. The storage scope attached to `ProduceRealizeNode` is used to create `PointerType` and `Buffer`. `Allocate` node query the storage scope associated with its buffer variable `PointerType`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921220669


   Thanks @manupa-arm . I understand that proposal R4 can also work by having a pass to convert "global" to something more specialize as a pass (essentially R1 and R4 are not that different except for different choices of scope names).
   
   The main question is what is the semantics around the scope "global". Each memory scope represent a "constraint" of what kind of memory it is. 
   
   Right now, when the device type is CPU,  "global" means any memory that can be accessed by the host cpu. This would include TVMBAW, memory from stack, or memory allocated by other means. While a memory allocated by TVMBAW can have other benefit(e.g. accessible by other devices because it is pinned), it is not the constraint specified by the "global" scope.
   
   We can of course further constraint the setting, to be say "global.workspace", that reduces the possible ways to allocate the memory, but still not preclude from choosing between multiple buffers.
   
   So from semantics point of view. The pass can choose to return "global" or rewrite to "global.stack", but if the scope is "global", we should not preclude the possibility for downstream from allocating from stack.
   
   So as we can see that this is another kind of flexibility we want to enable here -- flexibility of picking possible backend allocation implementations. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] mbs-octoml commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
mbs-octoml commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920485102


   Thank so much for for the context. I'll try to capture that in a comment.
   This is a 'bug' only in the sense that this heuristic is not working for the EthosU AOT codegen, I think because they are expecting to intercept the workpool ops downstream? But it does suggest a very simply way forward: make kMaxAlloca size a TargetKind attribute so they can force it to zero.
   @manupa-arm can you chime in here? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Posted by GitBox <gi...@apache.org>.
tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920882592


   Thanks for the discussions. Before we suggest a resolution, it would be helpful
   to summarize the discussions so far. 
   
   # Semantics of Allocate and storage_scope
   
   Allocate at the moment does not specify the way to allocate the memory.
   The lowering phase can feel free to pick a runtime  dependent one or a 
   more platform native one(e.g. stack alloca or direct declaration of shared memory).
   
   In practie, we see two different kinds of needs:
   - N0: The need for optimizing "device" level code.
   - N1: The need to be able to allocate memory from the host side, possibly
     in a runtime dependent way.
   
   N0 is needed to get best performing kernel, since a native way of allocation 
   that can be understood by the compiler would give the best chance for followup
   optimizations. This is the case for CPU related optimizations. Note that such
   optimization is also needed for micro setting, when we use TIR to generate kernel
   code that requested a temp scracth memory for output tiling.
   
   N1 is needed for "driving" the set of computations. In the case of GPU, we have
   to rely on host side driver function. In the case of micro AOT, it is desirable
   to allocate using platform specific memory.
   
   One possible approach is to try to ask user to differentiate these two kinds of
   allocations. However, to give user maximum convenience, TVM allows a mixed host/device
   programming model like CUDA and hide that complexity from the user. The boundary
   of N0 and N1 can also be blurred, e.g. a kernel can request to allocate from platform
   dependent memory.
   
   In specific specialized target devices, we also see an additional need:
   
   - N2: Allocating memory with special semantics. 
   
   For example, an unified device pinned memory that is accessible from both NPU and CPU. 
   A specialized texture memory or shared memory. The request in N2 is quite different 
   and brings additional requirement to how we allocate the memory and how the memory can be 
   used and represented in codegen.
   
   Right now N2 can be covered by introducing a special memory tag -- "global.npu_unified".
   This will usually allow the TVM to know that the memory needed to be lowered by possibly
   special target dependent passes, and leave them out.
   
   ## Possible Resolutions
   
   Based on discussions so far, there are two suggested resolutions.
   
   - R0: Separate out a "local" scope that carries the stack allocation heavior.
   - R1: Keep "global" scope as it is, introduce a special tagged scope "global.workspace"
     that represents a global memory request specifically for workspace memory. 
     And introduce lowerings for them. For specialized memory(e.g. NPU unified), introduce
     separate memory scopes. 
   - R3: Introduce target specific attribute that marks the possible stack alloca size for 
     lowering the R0("local") or R1("global"). Note R3 can be done in addition to R0 or R1.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org