You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Trevor Edwards (JIRA)" <ji...@apache.org> on 2018/08/31 17:08:00 UTC

[jira] [Commented] (AIRFLOW-2511) Subdag failed by scheduler deadlock

    [ https://issues.apache.org/jira/browse/AIRFLOW-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599024#comment-16599024 ] 

Trevor Edwards commented on AIRFLOW-2511:
-----------------------------------------

Doing some digging into this issue, it seems like this kind of error should only happen if sessions are being misused: http://docs.sqlalchemy.org/en/latest/faq/sessions.html#this-session-s-transaction-has-been-rolled-back-due-to-a-previous-exception-during-flush-or-similar

 

Reviewing 1.10 (though 1.9.0 is similar) https://github.com/apache/incubator-airflow/blob/1.10.0/airflow/models.py#L1602-L1673, it seems like there are several calls to methods which call session.commit() but do not catch exceptions and rollback. The broad except at the end of the highlighted region in my link also attempts to do a session.commit() in the method call, meaning if we swallow a sql exception in order to get to that line, we'll get this exception. I'm not convinced this is what caused this specific issue, but it is at least related and probably an issue as well. I'll keep digging.

> Subdag failed by scheduler deadlock
> -----------------------------------
>
>                 Key: AIRFLOW-2511
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2511
>             Project: Apache Airflow
>          Issue Type: Bug
>    Affects Versions: 1.9.0
>            Reporter: Yohei Onishi
>            Priority: Major
>
> I am using subdag and sometimes main dag marked failed because of the following error. In this case, tasks in the subdag stopped.
> {code:java}
> hourly_dag = DAG(
>   hourly_dag_name,
>   default_args=dag_default_args,
>   params=dag_custom_params,
>   schedule_interval=config_values.hourly_job_interval,
>   max_active_runs=2)
> hourly_subdag = SubDagOperator(
>   task_id='s3_to_hive',
>   subdag=LoadFromS3ToHive(
>   hourly_dag,
>   's3_to_hive'),
>   dag=hourly_dag)
> {code}
> I got this error in main dag. bug in scheduler?
> {code:java}
> [2018-05-22 21:52:19,683] {models.py:1595} ERROR - This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (_mysql_exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: 'UPDATE task_instance SET state=%s WHERE task_instance.task_id = %s AND task_instance.dag_id = %s AND task_instance.execution_date = %s'] [parameters: ('queued', 'transfer_from_tmp_table_into_cleaned_table', 'rfid_warehouse_carton_wh_g_dl_dwh_csv_uqjp_1h.s3_to_hive', datetime.datetime(2018, 5, 7, 5, 2))] (Background on this error at: http://sqlalche.me/e/e3q8)
> Traceback (most recent call last):
> sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (_mysql_exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: 'UPDATE task_instance SET state=%s WHERE task_instance.task_id = %s AND task_instance.dag_id = %s AND task_instance.execution_date = %s'] [parameters: ('queued', 'transfer_from_tmp_table_into_cleaned_table', 'rfid_warehouse_carton_wh_g_dl_dwh_csv_uqjp_1h.s3_to_hive', datetime.datetime(2018, 5, 7, 5, 2))] (Background on this error at: http://sqlalche.me/e/e3q8)
> [2018-05-22 21:52:19,687] {models.py:1624} INFO - Marking task as FAILED.
> [2018-05-22 21:52:19,688] {base_task_runner.py:98} INFO - Subtask: [2018-05-22 21:52:19,688] {slack_hook.py:143} INFO - Message is prepared: 
> [2018-05-22 21:52:19,688] {base_task_runner.py:98} INFO - Subtask: {"attachments": [{"color": "danger", "text": "", "fields": [{"title": "DAG", "value": "<https://data-platform-airflow-tokyo.data.fastretailing.com:8080/admin/airflow/tree?dag_id=rfid_warehouse_carton_wh_g_dl_dwh_csv_uqjp_1h|rfid_warehouse_carton_wh_g_dl_dwh_csv_uqjp_1h>", "short": true}, {"title": "Owner", "value": "airflow", "short": true}, {"title": "Task", "value": "s3_to_hive", "short": false}, {"title": "Status", "value": "FAILED", "short": false}, {"title": "Execution Time", "value": "2018-05-07T05:02:00", "short": true}, {"title": "Duration", "value": "826.305929", "short": true}, {"value": "<https://data-platform-airflow-tokyo.data.fastretailing.com:8080/admin/airflow/log?dag_id=rfid_warehouse_carton_wh_g_dl_dwh_csv_uqjp_1h&task_id=s3_to_hive&execution_date=2018-05-07T05:02:00|View Task Log>", "short": false}]}]}
> [2018-05-22 21:52:19,688] {models.py:1638} ERROR - Failed at executing callback
> [2018-05-22 21:52:19,688] {models.py:1639} ERROR - This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (_mysql_exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: 'UPDATE task_instance SET state=%s WHERE task_instance.task_id = %s AND task_instance.dag_id = %s AND task_instance.execution_date = %s'] [parameters: ('queued', 'transfer_from_tmp_table_into_cleaned_table', 'rfid_warehouse_carton_wh_g_dl_dwh_csv_uqjp_1h.s3_to_hive', datetime.datetime(2018, 5, 7, 5, 2))] (Background on this error at: http://sqlalche.me/e/e3q8)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)