You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by "yelite (via GitHub)" <gi...@apache.org> on 2023/10/26 22:25:50 UTC

[PR] [Disco] Explicitly set the session on DPackedFunc and DModule [tvm]

yelite opened a new pull request, #15996:
URL: https://github.com/apache/tvm/pull/15996

   This PR changes how to get the `Session` from `DPackedFunc` or `DModule`. Without this, there will be data corruption in the message channel, if Python gc happens in a different thread than the thread that owns the Disco session, which is typical in a multi-thread environment like LLM inference server.
   
   An explanation on how such data corruption could occur:
   
   1. Every time dref.session gets called. A new Session Python object will be created, due to how FFI works https://github.com/apache/tvm/blob/ebbe38f3281776cfda4fce0b188892ab1c5c7572/python/tvm/runtime/disco/session.py#L42
   2. When we want to call the model through disco, we need to first obtain the DPackedFunc through DModule . This process calls the _get_cached_method on a new Session object (follows (1)) https://github.com/apache/tvm/blob/ebbe38f3281776cfda4fce0b188892ab1c5c7572/python/tvm/runtime/disco/session.py#L97
   3. In hasattr(self, "_cache") , a call is made to the tvm::ReflectionVTable::GetAttr to check if it exists in the TVM object https://github.com/apache/tvm/blob/ebbe38f3281776cfda4fce0b188892ab1c5c7572/python/tvm/runtime/disco/session.py#L107
   4. This call will fail and raise error, which gets caught by https://github.com/apache/tvm/blob/ebbe38f3281776cfda4fce0b188892ab1c5c7572/python/tvm/_ffi/_ctypes/packed_func.py#L239. hasattr will swallow this error and return False.
   5. During the creation of the FFI error in (4), traceback objects are created to help trace into C++ function. https://github.com/apache/tvm/blob/ebbe38f3281776cfda4fce0b188892ab1c5c7572/python/tvm/_ffi/base.py#L373 Holding a reference to the frame object creates circular references that goes 'frame -> parent frames -> frame that holds the traceback object -> traceback -> frame'. All local variables in the call stack will be indirectly referenced and cannot be freed by ref counting. This is okay because Python gc will collect these objects once they are unreachable.
   6. (5) prevents the DPackedFunc returned from _get_cached_method to be freed through Python ref counting, because it's in the frame that's indirectly referred by the circular reference in (5).
   7. If Python gc happens in the main thread, the DPackedFunc from (6) will be collected, because there is no references to the newly-created Session, other than the reference path rooted from the circular reference between traceback and frame (from (5)).
   8. (7) will result in writing to the pipe in the main thread from the DRef destructor. https://github.com/apache/tvm/blob/ebbe38f3281776cfda4fce0b188892ab1c5c7572/include/tvm/runtime/disco/session.h#L301 If other disco activity happens in another thread, there will be multiple threads writing to the same pipe without synchronization. Data will be corrupted.
   
   This PR fixes (1) and (3), so that the data corruption will not occur anymore.
   
   cc @junrushao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [Disco] Explicitly set the session on DPackedFunc and DModule [tvm]

Posted by "junrushao (via GitHub)" <gi...@apache.org>.
junrushao merged PR #15996:
URL: https://github.com/apache/tvm/pull/15996


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org