You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/04/06 15:52:02 UTC
[GitHub] [tvm-rfcs] areusch commented on pull request #62: Collage RFC

areusch commented on PR #62:
URL: https://github.com/apache/tvm-rfcs/pull/62#issuecomment-1090430092

   We discussed this RFC at the TVM Community meeting. Here are some notes beyond the [presentation content](https://docs.google.com/presentation/d/1sRz8hVy_619VmbSRwJPGmcMfgBzaqMOAVaVNxlSgGFM/edit#slide=id.p):
   - @mbs-octoml notes that this RFC isn't particularly set in stone and it may grow/change as this effort proceeds. It's a bit speculative right now.
   - note: "on TVM" means running regular Relay operators on TVM
   - in the autotuning case you'd do autotuning on each path from beginning to end?
     - currently we are doing autotuning on the fly. for autotvm this isn't too bad. for new metaschedule, every candidate kernel will be treated as its own tuning task, so there are a lot more candidate kernels to explore (tvm's present FuseOps pass is greedy and always combines kernels, but MetaScheduler does not necessarily
   - will we still use the cache mechanism to avoid re-tuning identical operators? 
     - yes, and we hope to get a good hit rate from the cache.
   - by cache do you mean something you've created yourself? or is it an online service or tuning log on github?
     - @mbs-octoml: there's an abstract cost-estimator interface (given an IRModule, return a double). in prototype, there's only 1 instantiation of such interface which runs using the TVM runners and the standard benchmarking machinery in Python. Inside OctoML, we'll have different instantiations of this interface which will consult our internal infrastructure. the caching in the prototype works by a naïve in-memory cache coupled with a bit of a hack to populate it with standard AutoTVM tuning records. 
   - @manupa-arm: after the candidate partitioning is searched, will collage consider merging adjacent offloaded subgraphs to the same compiler?
     - yes, there is a cleanup pass that will merge adjacent subgraphs. this could happen due to heuristics in how partitions are chosen. this is a little bit like how MergeComplierRegions works now. generally speaking we expect this to be a bit orthogonal to the results of the search, because we expect transition latency to be small. if we do see a difference, it means Collage was not trying to offload the right size of subgraphs.
   - if we identify
     - when i say partitioning, what do I mean? can you explore serial vs parallel? what about notions of inlining (e.g. a reshape shared 38 times)? should I be exploring doing the reshape and sharing the result? none of this is being done right now.
   - the changes needed are just to expand the search model and the way we compute end-to-end measurement latency?
     - the constraint we need to stay in here is that the transform you want to try is local--the changes can be confined to a subgraph and you can time that subgraph. beyond that, you get outside the bounds of dynamic programming.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org