You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by ur...@apache.org on 2021/08/11 14:30:32 UTC

[airflow] branch aip-39-docs created (now bce29bc)

This is an automated email from the ASF dual-hosted git repository.

uranusjr pushed a change to branch aip-39-docs
in repository https://gitbox.apache.org/repos/asf/airflow.git.


      at bce29bc  WIP

This branch includes the following new commits:

     new bce29bc  WIP

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


[airflow] 01/01: WIP

Posted by ur...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

uranusjr pushed a commit to branch aip-39-docs
in repository https://gitbox.apache.org/repos/asf/airflow.git

commit bce29bc40eaece4cb3d2a569a4f1984898867f5d
Author: Tzu-ping Chung <tp...@astronomer.io>
AuthorDate: Wed Aug 11 22:29:08 2021 +0800

    WIP
---
 docs/apache-airflow/concepts/dags.rst   | 14 ++++++--
 docs/apache-airflow/dag-run.rst         | 32 ++++++++++++-----
 docs/apache-airflow/howto/index.rst     |  1 +
 docs/apache-airflow/howto/timetable.rst | 63 +++++++++++++++++++++++++++++++++
 4 files changed, 99 insertions(+), 11 deletions(-)

diff --git a/docs/apache-airflow/concepts/dags.rst b/docs/apache-airflow/concepts/dags.rst
index c564ef8..acbf24c 100644
--- a/docs/apache-airflow/concepts/dags.rst
+++ b/docs/apache-airflow/concepts/dags.rst
@@ -148,14 +148,24 @@ The ``schedule_interval`` argument takes any value that is a valid `Crontab <htt
     with DAG("my_daily_dag", schedule_interval="0 * * * *"):
         ...
 
-Every time you run a DAG, you are creating a new instance of that DAG which Airflow calls a :doc:`DAG Run </dag-run>`. DAG Runs can run in parallel for the same DAG, and each has a defined ``execution_date``, which identifies the *logical* date and time it is running for - not the *actual* time when it was started.
+.. tip::
+
+    For more information on ``schedule_interval`` values, see :doc:`DAG Run </dag-run>`.
+
+    If ``schedule_interval`` is not enough to express the DAG's schedule, see :doc:`Timetables </howto/timetable>`.
+
+Every time you run a DAG, you are creating a new instance of that DAG which Airflow calls a :doc:`DAG Run </dag-run>`. DAG Runs can run in parallel for the same DAG, and each has a defined data interval, which identifies the *logical* date and time range it is running for - not the *actual* time when it was started.
 
 As an example of why this is useful, consider writing a DAG that processes a daily set of experimental data. It's been rewritten, and you want to run it on the previous 3 months of data - no problem, since Airflow can *backfill* the DAG and run copies of it for every day in those previous 3 months, all at once.
 
-Those DAG Runs will all have been started on the same actual day, but their ``execution_date`` values will cover those last 3 months, and that's what all the tasks, operators and sensors inside the DAG look at when they run.
+Those DAG Runs will all have been started on the same actual day, but their data intervals will cover those last 3 months, and that's what all the tasks, operators and sensors inside the DAG look at when they run.
 
 In much the same way a DAG instantiates into a DAG Run every time it's run, Tasks specified inside a DAG also instantiate into :ref:`Task Instances <concepts:task-instances>` along with it.
 
+.. seealso::
+
+    :doc:`Data Intervals <./data-interval>`
+
 
 DAG Assignment
 --------------
diff --git a/docs/apache-airflow/dag-run.rst b/docs/apache-airflow/dag-run.rst
index 5d47a0b..6bbe5e0 100644
--- a/docs/apache-airflow/dag-run.rst
+++ b/docs/apache-airflow/dag-run.rst
@@ -54,17 +54,31 @@ Cron Presets
 Your DAG will be instantiated for each schedule along with a corresponding
 DAG Run entry in the database backend.
 
-.. note::
+Data Interval
+-------------
+
+Each DAG run in Airflow has an assigned "data interval" that represents the time
+range it operates in. For a DAG scheduled with ``@daily``, for example, each of
+its data interval would start at midnight of each day, and end at midnight of
+the next day.
+
+A DAG run happens *after* its associated data interval has ended, to ensure the
+run is able to collect all the actual data within the time period. Therefore, a
+run covering the data period of 2020-01-01 will not start to run until
+2020-01-01 has ended, i.e. 2020-01-02 onwards.
+
+All dates in Airflow are tied to the data interval concept in some way. The
+"logical date" (also called ``execution_date`` from previous Airflow version)
+of a DAG run, for example, usually denotes the start of the data interval, not
+when the DAG is actually executed. Similarly, since the ``start_date`` argument
+for the DAG and its tasks points to the same logical date, a run will only
+be created after that data interval ends. So a DAG with ``@daily`` schedule and
+``start_date`` of 2020-01-01, for example, will not be created until 2020-01-02.
 
-    If you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01
-    will be triggered soon after 2020-01-01T23:59. In other words, the job instance is
-    started once the period it covers has ended.  The ``execution_date`` available in the context
-    will also be 2020-01-01.
+.. tip::
 
-    The first DAG Run is created based on the minimum ``start_date`` for the tasks in your DAG.
-    Subsequent DAG Runs are created by the scheduler process, based on your DAG’s ``schedule_interval``,
-    sequentially. If your start_date is 2020-01-01 and schedule_interval is @daily, the first run
-    will be created on 2020-01-02 i.e., after your start date has passed.
+    If ``schedule_interval`` is not enough to express your DAG's schedule,
+    logical date, or data interval, see :doc:`Customizing imetables </howto/timetable>`.
 
 Re-run DAG
 ''''''''''
diff --git a/docs/apache-airflow/howto/index.rst b/docs/apache-airflow/howto/index.rst
index efd5c48..9fb80fb 100644
--- a/docs/apache-airflow/howto/index.rst
+++ b/docs/apache-airflow/howto/index.rst
@@ -33,6 +33,7 @@ configuring an Airflow environment.
     set-config
     set-up-database
     operator/index
+    timetable
     customize-state-colors-ui
     customize-dag-ui-page-instance-name
     custom-operator
diff --git a/docs/apache-airflow/howto/timetable.rst b/docs/apache-airflow/howto/timetable.rst
new file mode 100644
index 0000000..2c9ebb3
--- /dev/null
+++ b/docs/apache-airflow/howto/timetable.rst
@@ -0,0 +1,63 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+
+Customizing DAG Scheduling with Timetables
+==========================================
+
+A DAG's scheduling strategy is determined by its internal "timetable". This
+timetable can be created by specifying the DAG's ``schedule_interval`` argument,
+as described in :doc:`DAG Run </dag-run>`. The timetable also dictates the data
+interval and the logical time of each run created for the DAG.
+
+However, there are situations when a cron expression or simple ``timedelta``
+periods cannot properly express the schedule. Some of the examples are:
+
+* Data intervals with "holes" between. (Instead of continous, as both the cron
+  expression and ``timedelta`` schedules represent.)
+* Run tasks on different times each day. For example, an astronomer may find it
+  useful to run a task on each sunset, to process data collected from the
+  previous sunlight period.
+* Schedules not following the Gregorian calendar. For example, create a run for
+  each month in the `Traditional Chinese Calendar`_. This is conceptually
+  similar to the sunset case above, but for a different time scale.
+* Rolling windows, or overlapping data intervals. For example, one may want to
+  have a run each day, but make each run cover the period of the previous seven
+  days. It is possible to "hack" this with a cron expression, but a custom data
+  interval would task specification more natural.
+
+.. _`Traditional Chinese Calendar`: https://en.wikipedia.org/wiki/Chinese_calendar
+
+
+For our example, let's say a company may want to run a job after each weekday,
+to process data collected during the work day. The first intuitively answer
+to this would be ``schedule_interval="0 0 * * 1-5"`` (midnight on Monday to
+Friday), but this means data collected on Friday will *not* be processed right
+after Friday, but on the next Monday, and that run's interval would be from
+midnight Friday to midnight *Monday*.
+
+This is, therefore, a case of the "holes" category; the intended schedule should
+leave the two weekend days. What we want is:
+
+* Schedule a run for each Monday, Tuesday, Wednesday, Thursday, and Friday. The
+  run's data interval would cover from the midnight of each day, to the midnight
+  of the next day.
+* Each run would be created right after the data interval ends. The run covering
+  Monday happens on midnight Tuesday and so on. The run covering Friday happens
+  on midnight Saturday. No runs happen on midnights Sunday and Monday.
+
+TODO...