You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/09/08 13:51:50 UTC

[GitHub] [airflow] NickYadance opened a new issue, #26237: Deadlock caused by airflow statistics query

NickYadance opened a new issue, #26237:
URL: https://github.com/apache/airflow/issues/26237

   ### Apache Airflow version
   
   Other Airflow 2 version
   
   ### What happened
   
   Scheduler got crashed soon after reboot.
   ```
   [SQL: UPDATE task_instance SET state=%s WHERE task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.run_id = %s AND task_instance.map_index = %s]
   [parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, 'xx', 'xx', 'scheduled__2022-09-07T13:05:00+00:00', -1)]
   (Background on this error at: http://sqlalche.me/e/14/e3q8)
   [2022-09-08 07:50:45,270] {kubernetes_executor.py:813} INFO - Shutting down Kubernetes executor
   [2022-09-08 07:50:45,292] {manager.py:341} WARNING - Ending without manager process.
   [2022-09-08 07:50:45,292] {scheduler_job.py:768} INFO - Exited execute loop
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1706, in _execute_context
       cursor, statement, parameters, context
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute
       cursor.execute(statement, parameters)
     File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute
       res = self._query(query)
     File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query
       db.query(q)
     File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query
       _mysql.connection.query(self, query)
   MySQLdb._exceptions.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/usr/share/miniconda2/envs/airflow/bin/airflow", line 8, in <module>
       sys.exit(main())
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/__main__.py", line 38, in main
       args.func(args)
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 51, in command
       return func(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/cli.py", line 99, in wrapper
       return f(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 75, in scheduler
       _run_scheduler_job(args=args)
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 46, in _run_scheduler_job
       job.run()
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 244, in run
       self._execute()
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 730, in _execute
       self.executor.start()
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py", line 82, in start
       self.kubernetes_executor.start()
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 539, in start
       self.clear_not_launched_queued_tasks()
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper
       return func(*args, session=session, **kwargs)
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 515, in clear_not_launched_queued_tasks
       ).update({TaskInstance.state: State.SCHEDULED})
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3196, in update
       execution_options={"synchronize_session": synchronize_session},
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1670, in execute
       result = conn._execute_20(statement, params or {}, execution_options)
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1520, in _execute_20
       return meth(self, args_10style, kwargs_10style, execution_options)
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 314, in _execute_on_connection
       self, multiparams, params, execution_options
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1399, in _execute_clauseelement
       cache_hit=cache_hit,
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1749, in _execute_context
       e, statement, parameters, cursor, context
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1930, in _handle_dbapi_exception
       sqlalchemy_exception, with_traceback=exc_info[2], from_=e
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
       raise exception
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1706, in _execute_context
       cursor, statement, parameters, context
     File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute
       cursor.execute(statement, parameters)
     File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute
       res = self._query(query)
     File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query
       db.query(q)
     File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query
       _mysql.connection.query(self, query)
   sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
   [SQL: UPDATE task_instance SET state=%s WHERE task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.run_id = %s AND task_instance.map_index = %s]
   [parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, 'xx', 'xx', 'scheduled__2022-09-07T13:05:00+00:00', -1)]
   (Background on this error at: http://sqlalche.me/e/14/e3q8)
   ```
   
   Database engine status shows that some select query is holding locks of the table `task_instance`, causing all the updates to that table hang and timeout. 
   
   ```
   LATEST DETECTED DEADLOCK
   ------------------------
   2022-09-08 06:31:35 139737395513088
   *** (1) TRANSACTION:
   TRANSACTION 4413895189, ACTIVE 67 sec starting index read
   mysql tables in use 2, locked 2
   LOCK WAIT 7 lock struct(s), heap size 1128, 3 row lock(s)
   MySQL thread id 1330145004, OS thread handle 139724943472384, query id 66415927240 10.244.2.0 airflow statistics
   SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.run_id AS task_instance_run_id, task_instance.map_index AS task_instance_map_index, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.pool_slots AS task_instance_pool_slots, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_in
   
   *** (1) HOLDS THE LOCK(S):
   RECORD LOCKS space id 522 page no 542731 n bits 120 index PRIMARY of table `airflow`.`task_instance` trx id 4413895189 lock_mode X locks rec but not gap
   Record lock, heap no 52 PHYSICAL RECORD: n_fields 29; compact format; info bits 0
    0: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes);
    1: len 8; hex 747261696e5f4d59; asc xx;;
    2: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes);
    3: len 4; hex 7fffffff; asc     ;;
    4: len 6; hex 000107166d96; asc     m ;;
    5: len 7; hex 01000000dc188c; asc        ;;
    6: len 7; hex 631989a00430e4; asc c    0 ;;
    7: SQL NULL;
    8: SQL NULL;
    9: len 7; hex 72756e6e696e67; asc running;;
    10: len 4; hex 80000002; asc     ;;
    11: len 30; hex 73672d73686f7065652d7265636f6d6d656e642d7265636f6d6d656e642d; asc xx; (total 47 bytes);
    12: len 4; hex 726f6f74; asc root;;
    13: len 4; hex 804e4728; asc  NG(;;
    14: len 12; hex 64656661756c745f706f6f6c; asc default_pool;;
    15: len 7; hex 64656661756c74; asc default;;
    16: len 4; hex 80000003; asc     ;;
    17: len 12; hex 426173684f70657261746f72; asc BashOperator;;
    18: len 7; hex 6319892f08ff4f; asc c  /  O;;
    19: len 4; hex 80000060; asc    `;;
    20: len 4; hex 80000004; asc     ;;
    21: len 5; hex 80047d942e; asc   } .;;
    22: len 4; hex 80000001; asc     ;;
    23: len 4; hex 804e478a; asc  NG ;;
    24: SQL NULL;
    25: SQL NULL;
    26: SQL NULL;
    27: SQL NULL;
    28: SQL NULL;
   
   
   *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
   RECORD LOCKS space id 520 page no 21486 n bits 112 index PRIMARY of table `airflow`.`dag_run` trx id 4413895189 lock_mode X locks rec but not gap waiting
   Record lock, heap no 45 PHYSICAL RECORD: n_fields 19; compact format; info bits 0
    0: len 4; hex 8005cdf8; asc     ;;
    1: len 6; hex 00010716a9ca; asc       ;;
    2: len 7; hex 02000002b72669; asc      &i;;
    3: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes);
    4: len 7; hex 63162b10000000; asc c +    ;;
    5: len 6; hex 717565756564; asc queued;;
    6: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes);
    7: len 1; hex 80; asc  ;;
    8: len 5; hex 80047d942e; asc   } .;;
    9: SQL NULL;
    10: SQL NULL;
    11: len 9; hex 7363686564756c6564; asc scheduled;;
    12: SQL NULL;
    13: len 30; hex 316333653765323163363664396663356365326434373835343466323665; asc 1c3e7e21c66d9fc5ce2d478544f26e; (total 32 bytes);
    14: len 4; hex 804ba58d; asc  K  ;;
    15: len 7; hex 63198bfe0195b2; asc c      ;;
    16: len 7; hex 63162b10000000; asc c +    ;;
    17: len 7; hex 63163920000000; asc c 9    ;;
    18: len 4; hex 80000001; asc     ;;
   
   
   *** (2) TRANSACTION:
   TRANSACTION 4413893066, ACTIVE 73 sec starting index read
   mysql tables in use 1, locked 1
   LOCK WAIT 7 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 2
   MySQL thread id 1329985529, OS thread handle 139729063040768, query id 66416312012 10.244.0.215 airflow updating
   UPDATE task_instance SET state='restarting' WHERE task_instance.task_id = 'xx' AND task_instance.dag_id = 'xx' AND task_instance.run_id = 'scheduled__2022-09-05T17:00:00+00:00' AND task_instance.map_index = -1
   
   *** (2) HOLDS THE LOCK(S):
   RECORD LOCKS space id 520 page no 21486 n bits 112 index PRIMARY of table `airflow`.`dag_run` trx id 4413893066 lock_mode X locks rec but not gap
   Record lock, heap no 45 PHYSICAL RECORD: n_fields 19; compact format; info bits 0
    0: len 4; hex 8005cdf8; asc     ;;
    1: len 6; hex 00010716a9ca; asc       ;;
    2: len 7; hex 02000002b72669; asc      &i;;
    3: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes);
    4: len 7; hex 63162b10000000; asc c +    ;;
    5: len 6; hex 717565756564; asc queued;;
    6: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes);
    7: len 1; hex 80; asc  ;;
    8: len 5; hex 80047d942e; asc   } .;;
    9: SQL NULL;
    10: SQL NULL;
    11: len 9; hex 7363686564756c6564; asc scheduled;;
    12: SQL NULL;
    13: len 30; hex 316333653765323163363664396663356365326434373835343466323665; asc 1c3e7e21c66d9fc5ce2d478544f26e; (total 32 bytes);
    14: len 4; hex 804ba58d; asc  K  ;;
    15: len 7; hex 63198bfe0195b2; asc c      ;;
    16: len 7; hex 63162b10000000; asc c +    ;;
    17: len 7; hex 63163920000000; asc c 9    ;;
    18: len 4; hex 80000001; asc     ;;
   
   
   *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
   RECORD LOCKS space id 522 page no 542731 n bits 120 index PRIMARY of table `airflow`.`task_instance` trx id 4413893066 lock_mode X locks rec but not gap waiting
   Record lock, heap no 52 PHYSICAL RECORD: n_fields 29; compact format; info bits 0
    0: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes);
    1: len 8; hex 747261696e5f4d59; asc xx;;
    2: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes);
    3: len 4; hex 7fffffff; asc     ;;
    4: len 6; hex 000107166d96; asc     m ;;
    5: len 7; hex 01000000dc188c; asc        ;;
    6: len 7; hex 631989a00430e4; asc c    0 ;;
    7: SQL NULL;
    8: SQL NULL;
    9: len 7; hex 72756e6e696e67; asc running;;
    10: len 4; hex 80000002; asc     ;;
    11: len 30; hex 73672d73686f7065652d7265636f6d6d656e642d7265636f6d6d656e642d; asc xx; (total 47 bytes);
    12: len 4; hex 726f6f74; asc root;;
    13: len 4; hex 804e4728; asc  NG(;;
    14: len 12; hex 64656661756c745f706f6f6c; asc default_pool;;
    15: len 7; hex 64656661756c74; asc default;;
    16: len 4; hex 80000003; asc     ;;
    17: len 12; hex 426173684f70657261746f72; asc BashOperator;;
    18: len 7; hex 6319892f08ff4f; asc c  /  O;;
    19: len 4; hex 80000060; asc    `;;
    20: len 4; hex 80000004; asc     ;;
    21: len 5; hex 80047d942e; asc   } .;;
    22: len 4; hex 80000001; asc     ;;
    23: len 4; hex 804e478a; asc  NG ;;
    24: SQL NULL;
    25: SQL NULL;
    26: SQL NULL;
    27: SQL NULL;
    28: SQL NULL;
   ```
   
   Noticed that the query has the sign of `airflow statistics` so i tried shutdown statsd (delete k8s deployments) blindly and it surprisingly worked. After that the scheduler reboot and run normally.  
   
   ### What you think should happen instead
   
   _No response_
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.1 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aaronolszewski commented on issue #26237: Deadlock caused by airflow statistics query

Posted by GitBox <gi...@apache.org>.
aaronolszewski commented on issue #26237:
URL: https://github.com/apache/airflow/issues/26237#issuecomment-1304249390

   @NickYadance I am running into this exact same issue. Has this been fixed in a later version?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] NickYadance closed issue #26237: Deadlock caused by airflow statistics query

Posted by GitBox <gi...@apache.org>.
NickYadance closed issue #26237: Deadlock caused by airflow statistics query
URL: https://github.com/apache/airflow/issues/26237


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] NickYadance commented on issue #26237: Deadlock caused by airflow statistics query

Posted by GitBox <gi...@apache.org>.
NickYadance commented on issue #26237:
URL: https://github.com/apache/airflow/issues/26237#issuecomment-1304996812

   No and the issue description is not actually accurate. The `statistics query` mainly indicates that the query has been waiting  
   for a long time for something, not 100% deadlock between transactions, maybe due to slow query, high mysql server IO ...
   
   For me , i resolve this by optimizing the IO load of the mysql server.  For the daedlock, maybe you can find some help here #27000 with enough debugging info. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org