You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/11/27 12:30:49 UTC

[GitHub] [airflow] jward-bw opened a new issue #12654: RFC: Merging backfills

jward-bw opened a new issue #12654:
URL: https://github.com/apache/airflow/issues/12654


   **Description**
   Make it possible to merge multiple backfills into a single run, by extending the `start_date` of a single dagrun to cover a time period inclusive of all backfills. 
   
   **Use case / motivation**
   There are cases where running multiple backfills is less efficient than having a single run, for example where tasks in successive runs would do duplicate work. 
   
   ***An example:***
   
   - We have a dag which runs every 6 hours, and processes batches of messages from the previous 6 hours by looking at the `execution_date` and the `next_execution_date` macro. 
   - This dag has a task which launches a scan across a very large HBase table looking for matching rows to apply these messages to. The scan takes the same amount of time regardless of the batch size. The scan is the most time-consuming part of the dagrun (let's say it takes 3 out of 4 hours for an average dagrun).
   - An external error causes 3 successive dagruns to fail.
   
   At this point we have 18 hours of data to catch up on. Assuming the external issue has been fixed, this would take on average 12 hours to process, meaning further delays to processing future jobs. If instead we could merge these runs into a single backfill, this would reduce the processing time from 12 hours to something like 6 hours, greatly reducing the impact of delayed processing and also resource usage on Airflow and HBase (in this case, but in general other external services).
   
   This issue of inefficient processing is one that I (and I'm sure others) have a need to solve. There are obviously other workarounds one could do but I don't think they are correct in the sense of Airflow good practices. For example:
   - Temporarily alter the schedule interval to cover the desired range.
   - Introduce an override in the Airflow variables to make the next run process X batches.
   - Temporarily alter the dag code.
   - Run the dag tasks manually and externally to airflow, with the desired parameters.
   
   All of these have their own pitfalls and invariably involve some other manual intervention in Airflow to ensure the database is kept accurate and/or future runs aren't affected.
   
   If there is some other solution to this problem that I am unaware of, please let me know. I have raised this as an RFC as any change that implements this feature would touch many areas of the code base, so would require some planning.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #12654: RFC: Merging backfills

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #12654:
URL: https://github.com/apache/airflow/issues/12654#issuecomment-734813137


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jward-bw commented on issue #12654: RFC: Merging backfills

Posted by GitBox <gi...@apache.org>.
jward-bw commented on issue #12654:
URL: https://github.com/apache/airflow/issues/12654#issuecomment-734876781


   > This sounds good! I wonder if we could mix it with triggering backfill externally #11302
   
   Yeah absolutely. I'll raise an AIP for backfill improvements when I get the chance and we can discuss what the scope of that should be there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #12654: RFC: Merging backfills

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #12654:
URL: https://github.com/apache/airflow/issues/12654#issuecomment-734819556


   This sounds good! I wonder if we could mix it with triggering backfill externally #11302


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #12654: RFC: Merging backfills

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #12654:
URL: https://github.com/apache/airflow/issues/12654#issuecomment-734907079


   @jward-bw I did a small draft last week so feel free to use it:
   https://docs.google.com/document/d/1q138mGBfr9uEJbe43sTPobj_30vo2g2pIel1rBt9fLg/edit?usp=sharing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #12654: RFC: Merging backfills

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #12654:
URL: https://github.com/apache/airflow/issues/12654#issuecomment-734816725


   Yes, this would be a good feature!
   
   What most people seem to do right now as a work around is to have a special "backfill dag" that does the batching.
   
   We (collectively) will need to spend some time designing an interface for this, and then likely raise it as an Airflow Improvment Proposal https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals
   
   I'll happily help you with this process.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org