You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "t oo (Jira)" <ji...@apache.org> on 2019/12/25 22:02:00 UTC

[jira] [Updated] (AIRFLOW-5866) Task_instance table too large causing issues?

     [ https://issues.apache.org/jira/browse/AIRFLOW-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

t oo updated AIRFLOW-5866:
--------------------------
    Affects Version/s: 1.10.6

> Task_instance table too large causing issues?
> ---------------------------------------------
>
>                 Key: AIRFLOW-5866
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5866
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: database, scheduler
>    Affects Versions: 1.10.3, 1.10.6
>            Reporter: t oo
>            Priority: Major
>         Attachments: EC2's `ps -ef pipe wc -l` count.png, Mysql queuedepth.png, mysql cpu_util.png, mysql dbconnections.png, mysql read latency.png, mysql write IOPS.png, mysql write latency.png
>
>
> mysql rds metastore - db.m5.large instance, 5.7.26 version
>  
> task_instance table has 2,848,160 rows
> dag_run table has 22768 rows
> dag table has 23 rows
> log table has 17,916,891 rows
>  
> airflow 1.10.3, using LocalExecutor, python 2.7, single ec2 m5.4xlarge, parallelism set to 45 (ie max 45 tasks at once). Just using externally triggered dags, no SLAs. No subdags/backfills. 4 gunicorn workers. Using dynamic dags
>  
> Everything was fine until yesterday, around 300 dag runs every day. Now today these issues appear all of a sudden (no code change, environment change.etc). I suspect the task_instance table has gotten too big and causing scheduler and mysql issues.
>  
> 1.
> 'Recent tasks' are showing blank on the web ui home page. admin/airflow/task_stats fails to display with 504 error after few mins but dag_stats endpoint shows dags are in running state
>  
> 2.
> dag_runs are stuck in running state > 20 hrs, seems no new tasks are being run (they are stuck in scheduled/queued state)
>  
> I then tried terminating the EC2 and getting a new one, the dagruns and tasks would then start finishing but then after few hours got into same situation as points 1/2 above. I believe certain dag ids (with many tasks) are hitting the issue, will know m
>  
>  
> You can see notable change in the graphs just after 4th november midnight (that is when the issue started). Around 30 dagruns (yes there are diff execution_dates running for same dagid at same time) start around 11pm each night.
>  
> Scheduler/webserver pids have remained up the entire time, no ec2 autoheals happened
>  
>  
> scheduler log shows:
> [2019-11-07 12:25:18,287] \{jobs.py:1185} INFO - DAG daga has 29/45 running and queued tasks
>  [2019-11-07 12:25:18,287] \{jobs.py:1185} INFO - DAG daga has 30/45 running and queued tasks
>  [2019-11-07 12:25:18,287] \{jobs.py:1185} INFO - DAG daga has 31/45 running and queued tasks
>  [2019-11-07 12:25:18,287] \{jobs.py:1185} INFO - DAG daga has 32/45 running and queued tasks
>  [2019-11-07 12:25:18,287] \{jobs.py:1185} INFO - DAG daga has 33/45 running and queued tasks
>  [2019-11-07 12:25:18,287] \{jobs.py:1185} INFO - DAG daga has 34/45 running and queued tasks
>  [2019-11-07 12:25:18,287] \{jobs.py:1185} INFO - DAG daga has 35/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG dage has 0/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG dage has 0/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,435] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 37/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 38/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 39/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 40/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 41/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 42/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 43/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 44/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:21,436] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:21,437] \{jobs.py:1185} INFO - DAG dage has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,193] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,194] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,194] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,194] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,194] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,194] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,194] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,194] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,194] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:42,194] \{jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
>  [2019-11-07 12:25:49,493] \{jobs.py:1185} INFO - DAG dagc has 0/45 running and queued tasks
>  [2019-11-07 12:25:49,493] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,494] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,494] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,494] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,494] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,494] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,494] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,494] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,494] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,495] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,495] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,495] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,495] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,495] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,495] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,495] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,495] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,495] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,496] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,496] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,496] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,496] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,496] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,496] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,496] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,496] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-07 12:25:49,496] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
> [2019-11-06 11:53:08,813] \{jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
>  [2019-11-06 11:53:08,813] \{jobs.py:1191} INFO - Not executing <TaskInstance: task1 2019-11-03 00:00:00+00:00 [scheduled]> since the number of tasks running or queued from DAG daga is >= to the DAG's task concurrency limit of 45
>  
>  
> select state,count(1) from airflowdb.task_instance --not mysql syntax (querying via Presto)
> where dag_id = 'daga' and execution_date in ( timestamp'2019-11-03 00:00:00',timestamp'2019-11-04 00:00:00',timestamp'2019-11-06 00:00:00',timestamp'2019-11-07 00:00:00') --these are the dates in the dag_run table with state of running
> group by state
> success 871
> failed 34
> upstream_failed 68
> skipped 568
> scheduled 486
> <null> 573
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)