You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Eric Nichols (Jira)" <ji...@apache.org> on 2019/10/31 00:23:00 UTC
[jira] [Updated] (AIRFLOW-5820) Long delay between individual tasks
in a large backfill
[ https://issues.apache.org/jira/browse/AIRFLOW-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Nichols updated AIRFLOW-5820:
----------------------------------
Description:
I am new to Airflow. I made a simple task in a trivial DAG. It takes 0.004 seconds to fill the DagBag, and the task takes only 3 seconds to run.
Max concurrency must be set to 1 since my task hits a public API with a rate limit in effect.
I set it up to backfill 3 years of data; so I need to run the task ~1000 times in sequence. This should take ~3000 seconds.
Unfortunately, Airflow spends 3 seconds running the task, and then waits around 40 seconds before starting the next day of the backfill. So more than 90% of the time is Airflow spinning, and the job takes ~10x longer than required.
I think there should be a way to make backfill jobs run quickly, one after another, in this very simple case I have described. There is simply not 40 seconds worth of necessary compute to do between tasks.
was:
I am new to Airflow. I made a simple task in a trivial DAG. It takes 0.004 seconds to fill the DagBag, and the task takes only 3 seconds to run.
Max concurrency must be set to 1 since my task hits a public API with a rate limit in effect.
I set it up to backfill 3 years of data; so I need to run the task ~1000 times in sequence. This should take ~3000 seconds.
Unfortunately, Airflow spends 3 seconds running task, and then waits around 40 seconds before starting the next day of the backfill. So more than 90% of the time is Airflow spinning, and the job takes ~10x longer than required.
I think there should be a way to make backfill jobs run quickly, one after another, in this very simple case I have described. There is simply not 40 seconds worth of necessary compute to do between tasks.
> Long delay between individual tasks in a large backfill
> -------------------------------------------------------
>
> Key: AIRFLOW-5820
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5820
> Project: Apache Airflow
> Issue Type: Improvement
> Components: backfill
> Affects Versions: 1.10.5
> Environment: Ubuntu 18
> Reporter: Eric Nichols
> Priority: Major
>
> I am new to Airflow. I made a simple task in a trivial DAG. It takes 0.004 seconds to fill the DagBag, and the task takes only 3 seconds to run.
> Max concurrency must be set to 1 since my task hits a public API with a rate limit in effect.
> I set it up to backfill 3 years of data; so I need to run the task ~1000 times in sequence. This should take ~3000 seconds.
> Unfortunately, Airflow spends 3 seconds running the task, and then waits around 40 seconds before starting the next day of the backfill. So more than 90% of the time is Airflow spinning, and the job takes ~10x longer than required.
> I think there should be a way to make backfill jobs run quickly, one after another, in this very simple case I have described. There is simply not 40 seconds worth of necessary compute to do between tasks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)