You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/08 00:50:43 UTC

[GitHub] [airflow] amoGLingle opened a new issue, #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

amoGLingle opened a new issue, #24909:
URL: https://github.com/apache/airflow/issues/24909

   ### Apache Airflow version
   
   2.1.0
   
   ### What happened
   
   We have been running Airflow 2.1.0 with Scheduler HA for about 8 months, having upgraded from 1.8.  Recently (last 3/4 months) we've encountered the situation where the Schedulers Lock up with no tasks running.
   
   Symptom:
   No tasks getting run.  Nothing running at all.  Restarted workers, no luck.
   
   Looked at scheduler logs on 2 schedulers (syslogs) and saw numerous entries like:
   {code}
   [root@af2-dod-prod-master1 centos]# cat /var/log/messages | grep "list index"
   Mar 29 03:10:03 af2-dod-prod-master1 scl: list index out of range#033[0m
   Mar 29 03:10:05 af2-dod-prod-master1 scl: list index out of range#033[0m
   --
   Mar 29 03:10:23 af2-dod-prod-master1 scl: [#033[34m2022-03-29 03:10:23,672#033[0m] {#033[34mcelery_executor.py:#033[0m295} ERROR#033[0m - Error sending Celery task: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: Timeout, PID: 15673 (Background on this error at: http://sqlalche.me/e/13/7s2a)
   Mar 29 03:10:23 af2-dod-prod-master1 scl: Celery Task ID: TaskInstanceKey(dag_id='dod_dsp_audience_edge', task_id='emit_datamine_druid_delay_to_influxdb', execution_date=datetime.datetime(2022, 3, 28, 20, 0, tzinfo=Timezone('UTC')), try_number=1)
   --
   Mar 29 03:10:03 af2-dod-prod-master1 scl: [#033[34m2022-03-29 03:10:03,639#033[0m] {#033[34mdagrun.py:#033[0m429} ERROR#033[0m - Marking run <DagRun dod_queue_execution_monitor_worker4 @ 2022-03-29 03:05:00+00:00: scheduled__2022-03-29T03:05:00+00:00, externally triggered: False> failed#033[0m
   Mar 29 03:10:03 af2-dod-prod-master1 scl: [#033[34m2022-03-29 03:10:03,639#033[0m] {#033[34mdagrun.py:#033[0m608} WARNING#033[0m - Failed to record first_task_scheduling_delay metric:
   Mar 29 03:10:03 af2-dod-prod-master1 scl: list index out of range#033[0m
   --
   Mar 29 03:10:01 af2-dod-prod-master1 scl: [#033[34m2022-03-29 03:10:01,631#033[0m] {#033[34mcelery_executor.py:#033[0m295} ERROR#033[0m - Error sending Celery task: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: Timeout, PID: 15673 (Background on this error at: http://sqlalche.me/e/13/7s2a)
   Mar 29 03:10:01 af2-dod-prod-master1 scl: Celery Task ID: TaskInstanceKey(dag_id='dod_sync_monitor', task_id='load_dod_sync_post_data', execution_date=datetime.datetime(2022, 3, 29, 3, 5, tzinfo=Timezone('UTC')), try_number=1)
   {code}
   which seems a bug in airflow or celery - the documentation at http://sqlalche.me/e/13/7s2a says that this happens when an app improperly ignores a transaction exception and doesn’t roll back. Further explanation at https://docs.sqlalchemy.org/en/13/faq/sessions.html#faq-session-rollback 
   
   A prior AIRFLOW jira shows this has been seen before: https://issues.apache.org/jira/browse/AIRFLOW-6202?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20%22This%20Session%27s%20transaction%20has%20been%20rolled%20back%20due%20to%20a%20previous%20exception%20during%20flush.%22
   
   We have encountered this issue 3 times in past ~4 months: twice on PROD cluster and once in the QA one.
   
   
   ### What you think should happen instead
   
   Schedulers should not hang due to locked transaction.  Tasks should keep executing.
   As my description above says, pointing out the relevant celery documentation, there seems to be a point in the code where the transaction isn't rolled back when it should be.
   
   ### How to reproduce
   
   I have no idea how to reproduce.  This happens during normal course of running dags.
   
   
   ### Operating System
   
   Centos Linux 7
   
   ### Versions of Apache Airflow Providers
   
   prod-master1 centos]# pip list
   apache-airflow                           2.1.0
   apache-airflow-providers-apache-druid    2.0.0
   apache-airflow-providers-apache-livy     2.0.0
   apache-airflow-providers-cncf-kubernetes 2.0.0
   apache-airflow-providers-ftp             1.1.0
   apache-airflow-providers-http            2.0.0
   apache-airflow-providers-imap            1.0.1
   apache-airflow-providers-mysql           2.0.0
   apache-airflow-providers-postgres        2.0.0
   apache-airflow-providers-snowflake       2.1.0
   apache-airflow-providers-sqlite          1.0.2
   
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   Manual hand deploy following instructions on Airflow website.
   
   ### Anything else
   
   This seems to occur only once every few months.  When it does, our production
   dags just lock up.  We have monitoring dags for each queue we have.  Each runs a small a single task that pushes to influx/grafana and grafana alerting to pagerduty alerting when such lockups occur (or other issues as well, like networking outages, task runners down).
   
   The description above shows logs with ERROR and pointer to where the issue might be: possibly not rolling back transaction in an exception.
   
   Hope this can be (or has already been) found and fixed.
   
   Thank You.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #24909:
URL: https://github.com/apache/airflow/issues/24909#issuecomment-1355469985

   > Question:
   > Does airflow have/support an official module/dag that does db cleanup?
   
   Look for `airflow db clean` command (added in 2.3 I think) 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #24909:
URL: https://github.com/apache/airflow/issues/24909#issuecomment-1245110172

   Thanks. That might help with pin-pointing it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #24909:
URL: https://github.com/apache/airflow/issues/24909#issuecomment-1178414883

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] amoGLingle commented on issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
amoGLingle commented on issue #24909:
URL: https://github.com/apache/airflow/issues/24909#issuecomment-1192115740

   As an experiment, we're turning off one of the Schedulers - no HA - to see if we still get a deadlock.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] amoGLingle commented on issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
amoGLingle commented on issue #24909:
URL: https://github.com/apache/airflow/issues/24909#issuecomment-1355356024

   Hello,
   I think we found the culprit and can close this.
   We had been occasionally running the db cleanup dag that is part of
   https://github.com/teamclairvoyant/airflow-maintenance-dags
   There didn't seem to be a correlation, but the last time it got run within an hour the system locked up.
   I did notice that there's an updated version that we weren't running, but haven't bothered to install it:
   The risk of running it is too high.
   
   Question:
   Does airflow have/support an official module/dag that does db cleanup?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] amoGLingle commented on issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
amoGLingle commented on issue #24909:
URL: https://github.com/apache/airflow/issues/24909#issuecomment-1244858209

   ah sry for the delay!
   DB is RDS mysql 8.0.23
   executor = CeleryExecutor
   celery module vers 4.4.2
   
   Full module list, just in case:
   {code}
   centos]# pip list
   Package                                  Version
   ---------------------------------------- ---------
   alembic                                  1.6.2
   amqp                                     2.6.1
   anyio                                    3.2.1
   apache-airflow                           2.1.0
   apache-airflow-providers-apache-druid    2.0.0
   apache-airflow-providers-apache-livy     2.0.0
   apache-airflow-providers-cncf-kubernetes 2.0.0
   apache-airflow-providers-ftp             1.1.0
   apache-airflow-providers-http            2.0.0
   apache-airflow-providers-imap            1.0.1
   apache-airflow-providers-mysql           2.0.0
   apache-airflow-providers-postgres        2.0.0
   apache-airflow-providers-snowflake       2.1.0
   apache-airflow-providers-sqlite          1.0.2
   apispec                                  3.3.2
   argcomplete                              1.12.3
   asn1crypto                               1.4.0
   async-generator                          1.10
   attrs                                    20.3.0
   Authlib                                  0.15.5
   azure-common                             1.1.27
   azure-core                               1.17.0
   azure-storage-blob                       12.8.1
   Babel                                    2.9.1
   bcrypt                                   3.2.0
   billiard                                 3.6.4.0
   blinker                                  1.4
   boto3                                    1.17.102
   botocore                                 1.20.102
   cached-property                          1.5.2
   cachetools                               4.2.2
   cattrs                                   1.0.0
   celery                                   4.4.2
   certifi                                  2020.12.5
   cffi                                     1.14.5
   chardet                                  4.0.0
   click                                    7.1.2
   clickclick                               20.10.2
   colorama                                 0.4.4
   colorlog                                 5.0.1
   commonmark                               0.9.1
   contextvars                              2.4
   croniter                                 1.0.13
   cryptography                             3.4.7
   dataclasses                              0.7
   defusedxml                               0.7.1
   dill                                     0.3.1.1
   dnspython                                1.16.0
   docutils                                 0.17.1
   email-validator                          1.1.2
   fab-oidc                                 0.0.9
   Flask                                    1.1.2
   Flask-Admin                              1.5.8
   Flask-AppBuilder                         3.3.0
   Flask-Babel                              1.0.0
   Flask-Bcrypt                             0.7.1
   Flask-Caching                            1.10.1
   Flask-JWT-Extended                       3.25.1
   Flask-Login                              0.4.1
   Flask-Mail                               0.9.1
   flask-oidc                               1.4.0
   Flask-OpenID                             1.2.5
   Flask-SQLAlchemy                         2.5.1
   Flask-WTF                                0.14.3
   google-auth                              1.32.0
   graphviz                                 0.16
   gunicorn                                 20.1.0
   h11                                      0.12.0
   httpcore                                 0.13.6
   httplib2                                 0.20.2
   httpx                                    0.18.2
   idna                                     2.10
   immutables                               0.15
   importlib-metadata                       1.7.0
   importlib-resources                      1.5.0
   inflection                               0.5.1
   influxdb                                 5.3.1
   iso8601                                  0.1.14
   isodate                                  0.6.0
   itsdangerous                             1.1.0
   Jinja2                                   2.11.3
   jmespath                                 0.10.0
   jsonschema                               3.2.0
   kombu                                    4.6.11
   kubernetes                               11.0.0
   lazy-object-proxy                        1.4.3
   ldap3                                    2.9
   lockfile                                 0.12.2
   Mako                                     1.1.4
   Markdown                                 3.3.4
   MarkupSafe                               1.1.1
   marshmallow                              3.12.1
   marshmallow-enum                         1.5.1
   marshmallow-oneofschema                  2.1.0
   marshmallow-sqlalchemy                   0.23.1
   msgpack                                  1.0.2
   msrest                                   0.6.21
   mysql-connector-python                   8.0.22
   mysqlclient                              2.0.3
   numpy                                    1.19.5
   oauth2client                             4.1.3
   oauthlib                                 3.1.1
   openapi-schema-validator                 0.1.5
   openapi-spec-validator                   0.3.0
   oscrypto                                 1.2.1
   pandas                                   1.1.5
   pendulum                                 2.1.2
   pep562                                   1.0
   pip                                      21.1.2
   polling2                                 0.4.7
   prison                                   0.1.3
   protobuf                                 3.17.3
   psutil                                   5.8.0
   psycopg2-binary                          2.9.1
   pyasn1                                   0.4.8
   pyasn1-modules                           0.2.8
   pycparser                                2.20
   pycryptodomex                            3.10.1
   pydruid                                  0.6.2
   Pygments                                 2.9.0
   PyJWT                                    1.7.1
   pyOpenSSL                                20.0.1
   pyparsing                                3.0.6
   pyrsistent                               0.17.3
   python-daemon                            2.3.0
   python-dateutil                          2.8.1
   python-editor                            1.0.4
   python-ldap                              3.3.1
   python-nvd3                              0.15.0
   python-slugify                           4.0.1
   python3-openid                           3.2.0
   pytz                                     2021.1
   pytzdata                                 2020.1
   PyYAML                                   5.4.1
   requests                                 2.25.1
   requests-oauthlib                        1.3.0
   rfc3986                                  1.5.0
   rich                                     9.2.0
   rsa                                      4.7.2
   s3transfer                               0.4.2
   semantic-version                         2.8.5
   setproctitle                             1.2.2
   setuptools                               57.0.0
   setuptools-rust                          0.12.1
   six                                      1.16.0
   sniffio                                  1.2.0
   snowflake-connector-python               2.5.1
   snowflake-sqlalchemy                     1.2.5
   SQLAlchemy                               1.3.24
   SQLAlchemy-JSONField                     1.0.0
   SQLAlchemy-Utils                         0.37.2
   swagger-ui-bundle                        0.0.8
   tabulate                                 0.8.9
   tenacity                                 6.2.0
   termcolor                                1.1.0
   text-unidecode                           1.3
   toml                                     0.10.2
   typing                                   3.7.4.3
   typing-extensions                        3.7.4.3
   unicodecsv                               0.14.1
   urllib3                                  1.26.6
   vine                                     1.3.0
   virtualenv                               15.1.0
   websocket-client                         1.1.0
   Werkzeug                                 1.0.1
   wheel                                    0.36.2
   WTForms                                  2.3.3
   zipp                                     3.4.1
   {code}
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] amoGLingle commented on issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
amoGLingle commented on issue #24909:
URL: https://github.com/apache/airflow/issues/24909#issuecomment-1355743923

   thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] amoGLingle commented on issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
amoGLingle commented on issue #24909:
URL: https://github.com/apache/airflow/issues/24909#issuecomment-1244862047

   Also, an update.
   We've been running  with a single schedule since the last hang and haven't see the issue since then.
   Not saying that HA is the issue, just that we haven't seen issue.
   Thx,
   G


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #24909:
URL: https://github.com/apache/airflow/issues/24909#issuecomment-1194067604

   Question: Which version of which database do you have  @amoGLingle ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #24909: Airflow Scheduler Deadlock - Transaction not rolled back on Exception?
URL: https://github.com/apache/airflow/issues/24909


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org