You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "potiuk (via GitHub)" <gi...@apache.org> on 2023/03/05 11:20:55 UTC

[GitHub] [airflow] potiuk commented on issue #27296: Task completes work but ends up failing due to a lock wait timeout exceeded error and does not honor retries

potiuk commented on issue #27296:
URL: https://github.com/apache/airflow/issues/27296#issuecomment-1455062224

   > Hi, may I know if there is anything else that should be taken into account regarding this issue? Actually we are running our Airflow (2.5.0) environment with the KubernetesExecutor and lock wait timeouts are happening quite often too. We are setting:
   > 
   > * parallelism = 1024
   > * max_active_tasks_per_dag = 16 (for some DAG, it is 32)
   > * max_active_runs_per_dag = 8
   > * max_tis_per_query = 1024
   > 
   > If setting parallelism to a high value is considered an issue, may I know the reason why? We are using MySQL 8, transaction_isolation = REPEATABLE-READ, innodb_lock_wait_timeout=50s. One difference is that after failing, the tasks are being retried. We have not tried disabling `schedule_after_task_execution` yet, but as far as I notice, lock wait timeouts are happening in various cases, not only when running mini scheduler, but also when it tries deleting old records of `rendered_task_instance_fields` table. A related issue: #29687 (however, we are not hitting deadlock errors but timeouts instead
   > 
   > ```
   > sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
   > [SQL: DELETE FROM rendered_task_instance_fields WHERE rendered_task_instance_fields.dag_id = %s AND rendered_task_instance_fields.task_id = %s AND ((rendered_task_instance_fields.dag_id, rendered_task_instance_fields.task_id, rendered_task_instance_fields.run_id) NOT IN (SELECT anon_1.dag_id, anon_1.task_id, anon_1.run_id 
   > FROM (SELECT DISTINCT rendered_task_instance_fields.dag_id AS dag_id, rendered_task_instance_fields.task_id AS task_id, rendered_task_instance_fields.run_id AS run_id, dag_run.execution_date AS execution_date 
   > FROM rendered_task_instance_fields INNER JOIN dag_run ON rendered_task_instance_fields.dag_id = dag_run.dag_id AND rendered_task_instance_fields.run_id = dag_run.run_id 
   > WHERE rendered_task_instance_fields.dag_id = %s AND rendered_task_instance_fields.task_id = %s ORDER BY dag_run.execution_date DESC 
   >  LIMIT %s) AS anon_1))]
   > [parameters: ('my_dag', 'my_task', 'my_dag', 'my_task', 30)]
   > ```
   > 
   > Currently we have been running a crontab to kill long-sleeping connections to MySQL from Airflow and it seems to work but we have not found the root cause yet.
   
   @tanvn Yes I tihnk #29687 is the right one. I think what could help temporarily is to set the `max_num_rendered_ti_fields_per_task` to 0 . We realize that this query might be problematic on mysql especially and we are discussing with @Taragolis and @kaxil the right way of solving it. 
   
   The side effect of setting this value to 0 is that if you have a lot of rendered task instance fields and their size is significant, you might end up with quickly growing size of the Database. But you can in modern airflow to periodically run `airflow db clean` cleaning that table.
   
   Can you please set it and observe your DB and let us know - ideally in #29687 the db size growth and whether it fixed your problems ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org