You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/05/21 14:35:15 UTC

[GitHub] [airflow] yuqian90 commented on issue #15938: celery_executor becomes stuck if child process receives signal before reset_signals is called

yuqian90 commented on issue #15938:
URL: https://github.com/apache/airflow/issues/15938#issuecomment-845994258


   > Just how slow does it have to be to happen?
   > We can probably guard this by closing of the current pid when we register them, and checking that the signal is received by the same pid
   
   Hi, @ashb it's not clear to me how slow it must be exactly for this to happen. It looks like as long as some child processes are a fraction of a second slower than the others, they easily get into a deadlock when a SIGTERM is received. So even a transient slowness of a beefy machine can cause this to happen. 
   
   Here's what I tried so far. Only the last method seems to fix the issue completely (i.e. we have to stop using `multiprocessing.Pool`):
   - Tried to reset the signal handler to `signal.SIG_DFL` in `register_signals` if the current process is a child process. This doesn't help because the child process inherits the parent's signal handler when it's forked. Still hangs occasionally.
   - Tried to make `_exit_gracefully` a no-op if the current process is a child process. This isn't sufficient. Still hangs occasionally.
   - Tried to change multiprocessing to use "spawn" instead of "fork" like some people suggested [on the internet](https://pythonspeed.com/articles/python-multiprocessing/), it greatly reduced the chance of this issue happening. However, after running the reproducing example about 8000 times, it still happened. So it doesn't fix the issue completely.
   - **Replace `multiprocessing.Pool` with `concurrent.futures.process.ProcessPoolExecutor`. Once this is done, the reproducing example no longer hangs even after running it tens of thousands times.**. So I put up PR #15938 which fixes the issue using this method. 
   
   From experience, `multiprocessing.Pool` is notorious for causing mysterious hangs like these. Using `ProcessPoolExecutor` does not cause the same problems. It has similar interface and uses similar underlying libraries. I don't understand exactly why it fixes the issue, but in practice it always seems to help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org