You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/17 17:15:21 UTC

[GitHub] [airflow] seanmuth opened a new issue, #28431: Celery Stalled Running Tasks (SRT)

seanmuth opened a new issue, #28431:
URL: https://github.com/apache/airflow/issues/28431

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   Reproduced in versions 2.4.3 and 2.5.0.
   
   Tasks can get stuck/hung in a running state, while doing nothing, indefinitely.
   
   
   
   ### What you think should happen instead
   
   I appears that a psycopg2.connect() call is hung, according to stack traces (below). Unclear why the connection does not timeout.
   
   ```astro@extragalactic-wavelength-1420-worker-default-85587d6dc8-74pnn:/usr/local/airflow$ ~/.local/bin/py-spy dump -p 4343
   Process 4343: airflow task runner: dachshund_dogfood dog_13 scheduled__2022-12-17T07:51:00+00:00 350017
   Python v3.9.15 (/usr/local/bin/python3.9)
   
   Thread 4343 (idle): "MainThread"
       connect (psycopg2/__init__.py:122)
       connect (sqlalchemy/engine/default.py:598)
       connect (sqlalchemy/engine/create.py:578)
       __connect (sqlalchemy/pool/base.py:680)
       __init__ (sqlalchemy/pool/base.py:386)
       _create_connection (sqlalchemy/pool/base.py:271)
       _do_get (sqlalchemy/pool/impl.py:256)
       checkout (sqlalchemy/pool/base.py:491)
       _checkout (sqlalchemy/pool/base.py:888)
       connect (sqlalchemy/pool/base.py:325)
       _wrap_pool_connect (sqlalchemy/engine/base.py:3361)
       raw_connection (sqlalchemy/engine/base.py:3394)
       __init__ (sqlalchemy/engine/base.py:96)
       connect (sqlalchemy/engine/base.py:3315)
       _connection_for_bind (sqlalchemy/orm/session.py:747)
       _connection_for_bind (sqlalchemy/orm/session.py:735)
       connection (sqlalchemy/orm/session.py:626)
       _bulk_insert (sqlalchemy/orm/persistence.py:74)
       _bulk_save_mappings (sqlalchemy/orm/session.py:3896)
       bulk_insert_mappings (sqlalchemy/orm/session.py:3805)
       default_action_log (airflow/utils/cli_action_loggers.py:109)
       on_pre_execution (airflow/utils/cli_action_loggers.py:68)
       wrapper (airflow/utils/cli.py:94)
       command (airflow/cli/cli_parser.py:52)
       _start_by_fork (airflow/task/task_runner/standard_task_runner.py:95)
       start (airflow/task/task_runner/standard_task_runner.py:43)
       _execute (airflow/jobs/local_task_job.py:113)
       run (airflow/jobs/base_job.py:247)
       _run_task_by_local_task_job (airflow/cli/commands/task_command.py:252)
       _run_task_by_selected_method (airflow/cli/commands/task_command.py:193)
       task_run (airflow/cli/commands/task_command.py:396)
       wrapper (airflow/utils/cli.py:108)
       command (airflow/cli/cli_parser.py:52)
       _execute_in_fork (airflow/executors/celery_executor.py:130)
       execute_command (airflow/executors/celery_executor.py:96)
       __protected_call__ (celery/app/trace.py:734)
       trace_task (celery/app/trace.py:451)
       fast_trace_task (celery/app/trace.py:649)
       workloop (billiard/pool.py:362)
       __call__ (billiard/pool.py:292)
       run (billiard/process.py:114)
       _bootstrap (billiard/process.py:327)
       _launch (billiard/popen_fork.py:79)
       __init__ (billiard/popen_fork.py:24)
       _Popen (billiard/context.py:333)
       start (billiard/process.py:124)
       _create_worker_process (billiard/pool.py:1158)
       _create_worker_process (celery/concurrency/asynpool.py:480)
       __init__ (billiard/pool.py:1046)
       __init__ (celery/concurrency/asynpool.py:463)
       on_start (celery/concurrency/prefork.py:109)
       start (celery/concurrency/base.py:129)
       start (celery/bootsteps.py:365)
       start (celery/bootsteps.py:116)
       start (celery/worker/worker.py:203)
       worker (celery/bin/worker.py:351)
       caller (celery/bin/base.py:134)
       new_func (click/decorators.py:26)
       invoke (click/core.py:760)
       invoke (click/core.py:1404)
       invoke (click/core.py:1657)
       main (click/core.py:1055)
       start (celery/app/base.py:371)
       worker_main (celery/app/base.py:391)
       worker (airflow/cli/commands/celery_command.py:200)
       wrapper (airflow/utils/cli.py:108)
       command (airflow/cli/cli_parser.py:52)
       main (airflow/__main__.py:39)
       <module> (airflow:8)```
   
   ### How to reproduce
   
   After seeing this behavior on an Astronomer customer deployment, I created a reproduction DAG:
   
   ```from datetime import datetime, timedelta
   
   from airflow import DAG
   from airflow.providers.postgres.operators.postgres import PostgresOperator
   
   
   def create_dag(dag_id, **kwargs):
       with DAG(
           dag_id,
           start_date=datetime(2022, 12, 12),
           schedule_interval='*/3 * * * *',
           catchup=False,
           max_active_runs=1
       ) as dag:
           tasks = []
           for n in range(25):
               def create_task(x):
                   return PostgresOperator(
                       task_id=f'dog_{x}',
                       postgres_conn_id='airflow_sql_alchemy_conn',
                       sql=f"SELECT 'dog_{x}';",
                       sla=timedelta(minutes=1),
                       dag=dag
                   )
               tasks.append(create_task(n))
           for ix, t in enumerate(tasks):
               if ix != 0:
                   t.set_upstream(tasks[ix-1])
   
   dog_breeds = [
       'labrador', 'cocker_spaniel', 'german_shepherd', 'boxer', 'terrier', 'beagle', 'poodle', 'dachshund', 'shih_tzu', 'dalmatian',
       'border_collie', 'pug', 'rottweiler', 'bulldog', 'frenchie', 'great_dane', 'dobermann', 'chihuahua', 'maltese', 'malamute'
   ]
   
   for dog in dog_breeds:
       new_dag = create_dag(f'{dog}_dogfood')
   ```
   
   Uses `airflow_sql_alchemy_conn` which is created from the default `sql_alchemy_conn` to the metadata db.
   
   Left to run overnight, this consistently causes an SRT:
   ![image](https://user-images.githubusercontent.com/12852047/208253307-f65ac466-43e9-468a-91bc-6f0f5ada10de.png)
   
   Local task logs:
   ```
   [2022-12-17, 12:22:27 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: dachshund_dogfood.dog_4 scheduled__2022-12-17T12:18:00+00:00 [queued]>
   [2022-12-17, 12:22:28 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: dachshund_dogfood.dog_4 scheduled__2022-12-17T12:18:00+00:00 [queued]>
   [2022-12-17, 12:22:28 UTC] {taskinstance.py:1362} INFO - 
   --------------------------------------------------------------------------------
   [2022-12-17, 12:22:28 UTC] {taskinstance.py:1363} INFO - Starting attempt 1 of 1
   [2022-12-17, 12:22:28 UTC] {taskinstance.py:1364} INFO - 
   --------------------------------------------------------------------------------
   [2022-12-17, 12:22:28 UTC] {listener.py:27} INFO - TaskInstance Details: dag_id=dachshund_dogfood, task_id=dog_4, dagrun_id=scheduled__2022-12-17T12:18:00+00:00, map_index=-1, run_start_date=2022-12-17 12:22:27.947465+00:00, try_number=1, job_id=698048, op_classpath=airflow.providers.postgres.operators.postgres.PostgresOperator
   [2022-12-17, 12:22:28 UTC] {taskinstance.py:1383} INFO - Executing <Task(PostgresOperator): dog_4> on 2022-12-17 12:18:00+00:00
   [2022-12-17, 12:22:28 UTC] {standard_task_runner.py:82} INFO - Running: ['airflow', 'tasks', 'run', 'dachshund_dogfood', 'dog_4', 'scheduled__2022-12-17T12:18:00+00:00', '--job-id', '698048', '--raw', '--subdir', 'DAGS_FOLDER/dogfood_dag.py', '--cfg-path', '/tmp/tmpqab801s7']
   [2022-12-17, 12:22:28 UTC] {standard_task_runner.py:83} INFO - Job 698048: Subtask dog_4
   [2022-12-17, 12:22:28 UTC] {standard_task_runner.py:55} INFO - Started process 21377 to run task
   ```
   
   
   
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   ```
   apache-airflow-providers-amazon==6.2.0
   apache-airflow-providers-apache-hive==4.1.1
   apache-airflow-providers-apache-livy==3.2.0
   apache-airflow-providers-celery==3.1.0
   apache-airflow-providers-cncf-kubernetes==5.0.0
   apache-airflow-providers-common-sql==1.2.0
   apache-airflow-providers-databricks==4.0.0
   apache-airflow-providers-dbt-cloud==2.3.0
   apache-airflow-providers-elasticsearch==4.3.1
   apache-airflow-providers-ftp==3.2.0
   apache-airflow-providers-google==8.6.0
   apache-airflow-providers-http==4.1.0
   apache-airflow-providers-imap==3.1.0
   apache-airflow-providers-microsoft-azure==5.0.0
   apache-airflow-providers-postgres==5.2.2
   apache-airflow-providers-redis==3.1.0
   apache-airflow-providers-sftp==4.2.0
   apache-airflow-providers-snowflake==4.0.2
   apache-airflow-providers-sqlite==3.2.1
   apache-airflow-providers-ssh==3.3.0
   astronomer-providers==1.11.2
   ```
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   Astronomer Runtime 6.0.5 / Airflow 2.4.3+astro.2 
   
   ### Anything else
   
   No pattern found to date. One SRT doesn't block other worker threads, tasks are scheduled and executed as expected.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] mobuchowski commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "mobuchowski (via GitHub)" <gi...@apache.org>.
mobuchowski commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1420682600

   @kaxil I believe it fixes that. Why would this not be a part of 2.5.2 for example though?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1425925701

   The PR template has the instructions... in your PR you removed the template :) 
   ![Screenshot 2023-02-10 at 16 56 55](https://user-images.githubusercontent.com/45845474/218122975-28aec5f0-9049-49ad-9cd1-f80767205091.png)
   
   just open a new PR adding a file named  `29289.significant.rst` to https://github.com/apache/airflow/tree/main/newsfragments and in it explain what is the essence of the change (what users needs to be aware of). We use these news fragments to generate the release notes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] seanmuth commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by GitBox <gi...@apache.org>.
seanmuth commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1356348904

   Adding `strace` output of a stuck airflow task runner:
   
   ```
   strace: Process 4343 attached
   futex(0x7f2f484f3e78, FUTEX_WAIT_PRIVATE, 2, NULL
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1418900558

   cc @kaxil WDYT?
   seems like a compatibility issue between Airflow 2.5 and Open Lineage. not sure if this is Airflow issue (since we are discussing [native integration with Open Lineage](https://lists.apache.org/thread/2brvl4ynkxcff86zlokkb47wb5gx8hw7) worth checking is issue)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] kaxil commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "kaxil (via GitHub)" <gi...@apache.org>.
kaxil commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1420004383

   cc @mobuchowski


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] mobuchowski commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "mobuchowski (via GitHub)" <gi...@apache.org>.
mobuchowski commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1425889350

   @eladkal I'm not sure what news fragment is, so if I need to do something please direct me to some resource


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] seanmuth commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by GitBox <gi...@apache.org>.
seanmuth commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1357965618

   I should also add that when setting `OPENLINEAGE_DISABLED=true` I am unable to reproduce the issue. Indicating a strong correlation to OL being involved, but I haven't been able to find hard evidence of what OL is doing to "block" that final `psycopg1.connect()` call in some way.
   
   Discussion on OL: https://github.com/OpenLineage/OpenLineage/discussions/1423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] kaxil commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "kaxil (via GitHub)" <gi...@apache.org>.
kaxil commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1420005256

   https://github.com/apache/airflow/pull/29289 should fix it that would be part of 2.6 but @mobuchowski can confirm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1424906545

   So I'm closing this issue as fixed in main and planned to be released in 2.6
   (If newa fragment is required I guess better to add it now rather than coming back to it during cherry picking @ephraimbuddy @mobuchowski )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ephraimbuddy commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "ephraimbuddy (via GitHub)" <gi...@apache.org>.
ephraimbuddy commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1425909015

   @mobuchowski , See https://github.com/apache/airflow/blob/main/newsfragments/29395.significant.rst. It's useful to have it if there's a significant change or a breaking change you want to be mentioned in the release note. E.g https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst#significant-changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1356337868

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] mobuchowski commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "mobuchowski (via GitHub)" <gi...@apache.org>.
mobuchowski commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1422448573

   I don't think `airflow/listeners/events.py` was ment to be any part of public api or something imported directly by users - the API is contained in `airflow/listener/spec`.
   
   There is however one breaking change that I'm working on fixing in separate PR: passing proper previous TaskInstanceState to `on_task_instance_*` methods.
   
   I don't know Airflow release process, so I defer judgement to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ephraimbuddy commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "ephraimbuddy (via GitHub)" <gi...@apache.org>.
ephraimbuddy commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1422404226

   > @kaxil I believe it fixes that. Why would this not be a part of 2.5.2 for example though?
   
   Looks like we need to add a newsfragment entry for this fix since it removed `airflow/listeners/events.py`.
   
   IMO, it shouldn't be part of 2.5.2 since it's more like an improvement and also have a breaking change in it: users can not import anything in `airflow/listeners/events.py` again


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] mobuchowski commented on issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "mobuchowski (via GitHub)" <gi...@apache.org>.
mobuchowski commented on issue #28431:
URL: https://github.com/apache/airflow/issues/28431#issuecomment-1429496327

   FYI @eladkal @ephraimbuddy https://github.com/apache/airflow/pull/29528


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal closed issue #28431: Celery Stalled Running Tasks (SRT)

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal closed issue #28431: Celery Stalled Running Tasks (SRT)
URL: https://github.com/apache/airflow/issues/28431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org