You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/05/21 14:27:34 UTC

[GitHub] [airflow] yuqian90 opened a new pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

yuqian90 opened a new pull request #15989:
URL: https://github.com/apache/airflow/pull/15989


   Fixes #15938
   
   `multiprocessing.Pool` is known to often become stuck. It causes celery_executor to hang randomly. This happens at least on Debian, Ubuntu using Python 3.8.7 and Python 3.8.10. The issue is reproducible by running `test_send_tasks_to_celery_hang` in this PR several times (with db backend set to something other than sqlite because sqlite disables some parallelization) 
   
   The issue goes away once switched to `concurrent.futures.ProcessPoolExecutor`. In python 3.6 and earlier, `ProcessPoolExecutor` has no `initializer` argument. Fortunately, it's not needed because `reset_signal` is no longer needed because the signal handler now checks if the current process is the parent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #15989:
URL: https://github.com/apache/airflow/pull/15989#issuecomment-849712560


   Oh yes, will take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 merged pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

Posted by GitBox <gi...@apache.org>.
yuqian90 merged pull request #15989:
URL: https://github.com/apache/airflow/pull/15989


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #15989:
URL: https://github.com/apache/airflow/pull/15989#issuecomment-849723516


   Could you just address the few small niggles (mostly to minimize the change so it is easier to backport, and easier to read.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #15989:
URL: https://github.com/apache/airflow/pull/15989#discussion_r640727665



##########
File path: tests/executors/test_celery_executor.py
##########
@@ -484,3 +485,57 @@ def test_should_support_base_backend(self):
         assert [
             'DEBUG:airflow.executors.celery_executor.BulkStateFetcher:Fetched 2 state(s) for 2 task(s)'
         ] == cm.output
+
+
+class MockTask:
+    """
+    A picklable object used to mock tasks sent to Celery. Can't use the mock library
+    here because it's not picklable.
+    """
+
+    def apply_async(self, *args, **kwargs):
+        return 1
+
+
+def _exit_gracefully(signum, _):
+    print(f"{os.getpid()} Exiting gracefully upon receiving signal {signum}")
+    sys.exit(signum)
+
+
+@pytest.fixture
+def register_signals():
+    """
+    Register the same signals as scheduler does to test celery_executor to make sure it does not
+    hang.
+    """
+
+    print(f"{os.getpid()} register_signals()")
+

Review comment:
       ```suggestion
   ```

##########
File path: airflow/jobs/scheduler_job.py
##########
@@ -745,21 +753,24 @@ def register_signals(self) -> None:
 
     def _exit_gracefully(self, signum, frame) -> None:  # pylint: disable=unused-argument
         """Helper method to clean up processor_agent to avoid leaving orphan processes."""
-        self.log.info("Exiting gracefully upon receiving signal %s", signum)
-        if self.processor_agent:
-            self.processor_agent.end()
-        sys.exit(os.EX_OK)
+        if _is_parent_process():
+            # Only the parent process should perform the cleanup.
+            self.log.info("Exiting gracefully upon receiving signal %s", signum)
+            if self.processor_agent:
+                self.processor_agent.end()
+            sys.exit(os.EX_OK)
 
     def _debug_dump(self, signum, frame):  # pylint: disable=unused-argument
-        try:
-            sig_name = signal.Signals(signum).name  # pylint: disable=no-member
-        except Exception:  # pylint: disable=broad-except
-            sig_name = str(signum)
+        if _is_parent_process():

Review comment:
       ```suggestion
           if not _is_parent_process():
               return
   ```
   
   Then we don't need to indent/change the rest of this fn.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 commented on pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

Posted by GitBox <gi...@apache.org>.
yuqian90 commented on pull request #15989:
URL: https://github.com/apache/airflow/pull/15989#issuecomment-850249234


   > Could you just address the few small niggles (mostly to minimize the change so it is easier to backport, and easier to read.)
   
   Thanks. These have been addressed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] davidcaron commented on pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

Posted by GitBox <gi...@apache.org>.
davidcaron commented on pull request #15989:
URL: https://github.com/apache/airflow/pull/15989#issuecomment-849681949


   We also had the same intermittent issues on our production servers. I wanted to help test this so I deployed this PR and the problem hasn't happened since. 
   
   Previously, the scheduler would hang once or twice per day on average. It's been running fine for 48 hours now. I will update this comment if something happens, but you can assume that no news is good news.
   
   Thank you for figuring this out!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #15989:
URL: https://github.com/apache/airflow/pull/15989#issuecomment-849736601


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #15989:
URL: https://github.com/apache/airflow/pull/15989#discussion_r640702837



##########
File path: scripts/ci/docker-compose/base.yml
##########
@@ -34,6 +34,8 @@ services:
     ports:
       - "${WEBSERVER_HOST_PORT}:8080"
       - "${FLOWER_HOST_PORT}:5555"
+    cap_add:

Review comment:
       Looks good no problem with capability added :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 commented on a change in pull request #15989: Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing

Posted by GitBox <gi...@apache.org>.
yuqian90 commented on a change in pull request #15989:
URL: https://github.com/apache/airflow/pull/15989#discussion_r636976979



##########
File path: scripts/ci/docker-compose/base.yml
##########
@@ -34,6 +34,8 @@ services:
     ports:
       - "${WEBSERVER_HOST_PORT}:8080"
       - "${FLOWER_HOST_PORT}:5555"
+    cap_add:

Review comment:
       `SYS_PTRACE` isn't necessary to fix the issue. I added this because it makes it possible to run `py-spy dump` in breeze. @potiuk let me know if you have concerns. I can remove this change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org