You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/07/11 20:31:49 UTC

[GitHub] [pulsar] zbentley commented on issue #16527: Due to missing Python GIL management, Python clients constructed with a non-default Logger fail to clean up their threads and often segfault when Python garbage collects them

zbentley commented on issue #16527:
URL: https://github.com/apache/pulsar/issues/16527#issuecomment-1180840687

   My theory as to why this is happening:
   
   The problem is the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock). When Python object reference counts are manipulated from compiled code, those manipulations are not atomic or protected by the GIL in any way. Incrementing a refcount is often coincidentally safe to do without the GIL, since the data structures in the Python interpreter that are altered by a refcount-bump are few and not terribly shared. However, decrementing a refcount without the GIL is extremely dangerous; the act of decrementing a refcount can trigger object destruction, which can then trigger more object destruction, and so on: decrementing a refcount triggers an arbitrary number of user functions (destructors) to run in Python, and can trigger wide-ranging changes (including system calls, memory allocation/deallocation--basically anything) across the interpreter's internal state.
   
   Running such operations in true multi-threaded parallel in Python is basically guaranteed to break things. In most cases (I'm guessing here, as I don't know Boost/C++ well), I think the attempt to clean up the reference either blocks or fails in such a way that the C++ runtime won't properly clean up an object, preventing thread reaping from running internally. In some cases, the racy python GC operations overlap with shared interpreter data structures and cause segfaults. In rare cases, "impossible" (i.e. random objects changing type) errors are raised in Python itself, though you may have to run the above snippet for a long time to see one of those.
   
   Such GIL-unprotected refcount manipulation occurs in the Pulsar client [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/src/config.cc#L106), [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/src/config.cc#L114), [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/src/config.cc#L172), and [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/src/config.cc#L176), though the first and third may be safe from this condition by dint of the fact that they're only invoked directly from calling Python code which already has the GIL.
   
   To see if this theory's correct, I'll put up a PR in a few hours which GIL-protects those segments and see if that makes the issues go away.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org