You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "tanvn (via GitHub)" <gi...@apache.org> on 2023/03/05 11:11:02 UTC

[GitHub] [airflow] tanvn commented on issue #27296: Task completes work but ends up failing due to a lock wait timeout exceeded error and does not honor retries

tanvn commented on issue #27296:
URL: https://github.com/apache/airflow/issues/27296#issuecomment-1455059928

   Hi, may I know if there is anything else that should be taken into account regarding this issue?
   Actually we are running our Airflow (2.5.0) environment with the KubernetesExecutor and lock wait timeouts are happening quite often too.
   We are setting:
   - parallelism = 1024
   - max_active_tasks_per_dag = 16 (for some DAG, it is 32)
   - max_active_runs_per_dag = 8
   - max_tis_per_query = 1024
   
   If setting parallelism to a high value is considered an issue, may I know the reason why?
   We are using MySQL 8, transaction_isolation = REPEATABLE-READ, innodb_lock_wait_timeout=50s.
   One difference is that after failing, the tasks are being retried.
   We have not tried disabling `schedule_after_task_execution` yet, but as far as I notice,  lock wait timeouts are happening in various cases, not only when running mini scheduler, but also when it tries deleting old records of `rendered_task_instance_fields` table. A related issue: https://github.com/apache/airflow/issues/29687
   (however, we are not hitting deadlock errors but timeouts instead
   ```
   sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
   [SQL: DELETE FROM rendered_task_instance_fields WHERE rendered_task_instance_fields.dag_id = %s AND rendered_task_instance_fields.task_id = %s AND ((rendered_task_instance_fields.dag_id, rendered_task_instance_fields.task_id, rendered_task_instance_fields.run_id) NOT IN (SELECT anon_1.dag_id, anon_1.task_id, anon_1.run_id 
   FROM (SELECT DISTINCT rendered_task_instance_fields.dag_id AS dag_id, rendered_task_instance_fields.task_id AS task_id, rendered_task_instance_fields.run_id AS run_id, dag_run.execution_date AS execution_date 
   FROM rendered_task_instance_fields INNER JOIN dag_run ON rendered_task_instance_fields.dag_id = dag_run.dag_id AND rendered_task_instance_fields.run_id = dag_run.run_id 
   WHERE rendered_task_instance_fields.dag_id = %s AND rendered_task_instance_fields.task_id = %s ORDER BY dag_run.execution_date DESC 
    LIMIT %s) AS anon_1))]
   [parameters: ('my_dag', 'my_task', 'my_dag', 'my_task', 30)]
   ```
   Currently we have been running a crontab to kill long-sleeping connections to MySQL from Airflow and it seems to work but we have not found the root cause yet.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org