You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Bas Harenslak (JIRA)" <ji...@apache.org> on 2019/01/31 22:25:00 UTC

[jira] [Created] (AIRFLOW-3797) Improve performance of cc1e65623dc7_add_max_tries_column_to_task_instance migration

Bas Harenslak created AIRFLOW-3797:
--------------------------------------

             Summary: Improve performance of cc1e65623dc7_add_max_tries_column_to_task_instance migration
                 Key: AIRFLOW-3797
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3797
             Project: Apache Airflow
          Issue Type: Improvement
            Reporter: Bas Harenslak


The cc1e65623dc7_add_max_tries_column_to_task_instance migration creates a DagBag for the corresponding DAG for every single task instance. This is very redundant and not necessary.

Hence, there are discussions on Slack like these:

{noformat}
murquizo   [Jan 17th at 1:33 AM]
Why does the airflow upgradedb command loop through all of the dags?

....

murquizo   [14 days ago]
NICE, @BasPH! that is exactly the migration that I was referring to.  We have about 600k task instances and have a several python files that generate multiple DAGs, so looping through all of the task_instances to update max_tries was too slow.  It took 3 hours and didnt even complete! i pulled the plug and manually executed the migration.   Thanks for your response.
{noformat}


An easy to accomplish improvement is to parse a DAG only once and after that set the task instance try_number. I created a branch for it (https://github.com/BasPH/incubator-airflow/tree/bash-optimise-db-upgrade), currently running tests and will make PR when done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)