You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "SamWheating (via GitHub)" <gi...@apache.org> on 2023/03/03 22:08:14 UTC

[GitHub] [airflow] SamWheating opened a new pull request, #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

SamWheating opened a new pull request, #29909:
URL: https://github.com/apache/airflow/pull/29909

   Re: https://github.com/apache/airflow/issues/29900
   
   This introduces a new `@continuous` Timetable which will always try to start new DAGRuns. 
   
   The degree of parallelism can be then bounded with the `max_active_dagruns` parameter. 
   
   This is a little bit different from the currently available approaches:
    - It doesn't have the notion of catchup like `schedule_interval="* * * * *"` and can hypothetically run more often
    - Its a lot lighter-weight (both in definition and execution) than using a TriggerDagOperator at the end of a DAG.
    
   Tested in Breeze with the following DAG:
   ```python
   from airflow.models import DAG
   from airflow.operators.bash import BashOperator
   from datetime import datetime
   
   from airflow.models.param import Param
   
   dag = DAG(
       f"continuous_dag",
       schedule_interval="@continuous",
       start_date=datetime(2021, 1, 1),
   )
   
   task = BashOperator(task_id="the_task", dag=dag, bash_command="sleep 10", owner="nobody!")
   ```
   
   And it seemed to work well:
    
   <img width="1893" alt="image" src="https://user-images.githubusercontent.com/16950874/222838739-9c236abb-0fab-49f0-8fb8-c7de18212616.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on PR #29909:
URL: https://github.com/apache/airflow/pull/29909#issuecomment-1456004184

   I think we also need to add it to the docs
   https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/timetable.html#built-in-timetables


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] SamWheating commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "SamWheating (via GitHub)" <gi...@apache.org>.
SamWheating commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1133154264


##########
airflow/models/dag.py:
##########
@@ -545,7 +552,10 @@ def __init__(
         self.template_undefined = template_undefined
         self.last_loaded = timezone.utcnow()
         self.safe_dag_id = dag_id.replace(".", "__dot__")
-        self.max_active_runs = max_active_runs
+        if isinstance(self.timetable, ContinuousTimetable):

Review Comment:
   It feels kind of messy to have this sort of conditional override in the `__intit__`, but I'm not sure where to put this important logic 🤔 
   
   Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an artificial limitation like this. I think in some cases a user might want to have multiple runs executing at all times (for example, a job with multiple stages which uses depends_on_past for continuous pipelined execution). Additionally, a similar hazard already exists with schedule_interval="* * * * *" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic tasks (especially when we get "depth-first execution" working rather than with spawning multiple dag runs). The difference vs. the `* * * * *` is that there we know that the speed of creating of the DagRuns will be 1/minute. Full stop. With "continuous" allowing for multiple dag runs - what is the speed of creation of those? Can we limit that? In `* * * * *` case, every dag run  has it's own "per minute" logical date and displaying it in the UI makes perfect sense.  In case of "continuous" we can of course enforce monotonic time for logical date somehow but having 100s of dagruns differing by microseconds sounds strange.
   
   I think when you try to describe it in the docs in the way to explain the users intricacies of such approach where you have more than one "max_active_run" and explain in detail what they can expect - you will see the problem clearly.  For example you will have to explain that the speed of ramping up such running dagruns and maximum number of parallel runs will depend not only on scheduler parameters https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule but also on other DAGs being scheduled at the same time (because they will be competing for the same dag run limit), the speed of DB, scheduler and number of schedulers you run - all connected in a non-trivial way and next to impossible to control by the user as it will change dynamically. Contrast this with dynamic task mapping where you have complete control not only over the maximum number of task instances but hopefully in near future also independently for each mapped tas
 k group how many of those tasks can run in parallell : https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will have. If people will start using continuous to  spawn massive number of parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely not ready for that. Set max_active_run to 1000 and try to visualize those dagruns in the current UI. That won't work. No pagination, no way to limit the results and navigate between them, no way to really distinguish one dag run from another. This is - again in stark contrast in all the investment that @bbovenzi made into making dynamic task UI navigable and usable when you have 100s and 1000s dynamically mapped tasks (Task Groups soon I guess). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an artificial limitation like this. I think in some cases a user might want to have multiple runs executing at all times (for example, a job with multiple stages which uses depends_on_past for continuous pipelined execution). Additionally, a similar hazard already exists with schedule_interval="* * * * *" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic tasks (especially when we get "depth-first execution" working rather than with spawning multiple dag runs. The difference vs. the "* * * * * *" is that there we know that the speed of creating of the DagRuns will be 1/minute. Full stop. With "continuous" allowing for multiple dag runs - what is the speed of creation of those? Can we limit that? In schedule case every dag run  has it's own "per minute" logical date and displaying it in the UI makes perfect sense.  In case of "continuous" we can of course enforce monotonic time for logical date somehow but having 100s of dagruns differing by microseconds sounds strange.
   
   I think when you try to describe it in the docs in the way to explain the users intricacies of such approach where you have more than one "max_active_run" and explain in detail what they can expect. For example you will have to explain that the speed of ramping up such running dagruns and maximum number of parallel runs will depend not only on scheduler parameters https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule but also on other DAGs being scheduled at the same time (because they will be competing for the same dag run limit), the speed of scheduler and number of schedulers - all connected in a non-trivial way and next to impossible to control by the user as it will change dynamically. Contrast this with dynamic task mapping where you have complete control not only over the maximum number of task instances but hopefully in near future also independently for each mapped task group how many of those tasks can run in paral
 lell : https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will have. If people will start using continuous to  spawn massive number of parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely not ready for that. Set max_active_run to 1000 and try to visualize those dagruns in the current UI. That won't work. No pagination, no way to limit the results and navigate between them, no way to really distinguish one dag run from another. This is - again in stark contrast in all the investment that @bbovenzi made into making dynamic task UI navigable and usable when you have 100s and 1000s dynamically mapped tasks (Task Groups soon I guess). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an artificial limitation like this. I think in some cases a user might want to have multiple runs executing at all times (for example, a job with multiple stages which uses depends_on_past for continuous pipelined execution). Additionally, a similar hazard already exists with schedule_interval="* * * * *" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic tasks (especially when we get "depth-first execution" working rather than with spawning multiple dag runs). The difference vs. the `* * * * *` is that there we know that the speed of creating of the DagRuns will be 1/minute. Full stop. With "continuous" allowing for multiple dag runs - what is the speed of creation of those? Can we limit that? In `* * * * *` case, every dag run  has it's own "per minute" logical date and displaying it in the UI makes perfect sense.  In case of "continuous" we can of course enforce monotonic time for logical date somehow but having 100s of dagruns differing by microseconds sounds strange (and we will have a very likely race problem with multiple schedulers trying to create dag runs with exact-same logical date at the same time)
   
   I think when you try to describe it in the docs in the way to explain the users intricacies of such approach where you have more than one "max_active_run" and explain in detail what they can expect - you will see the problem clearly.  For example you will have to explain that the speed of ramping up such running dagruns and maximum number of parallel runs will depend not only on scheduler parameters https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule but also on other DAGs being scheduled at the same time (because they will be competing for the same dag run limit), the speed of DB, scheduler and number of schedulers you run - all connected in a non-trivial way and next to impossible to control by the user as it will change dynamically. Contrast this with dynamic task mapping where you have complete control not only over the maximum number of task instances but hopefully in near future also independently for each mapped tas
 k group how many of those tasks can run in parallell : https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will have. If people will start using continuous to  spawn massive number of parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely not ready for that. Set max_active_run to 1000 and try to visualize those dagruns in the current UI. That won't work. No pagination, no way to limit the results and navigate between them, no way to really distinguish one dag run from another. This is - again in stark contrast in all the investment that @bbovenzi made into making dynamic task UI navigable and usable when you have 100s and 1000s dynamically mapped tasks (Task Groups soon I hope). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125630661


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > I'd say the only real good use case for continuous is with max_active_run =1 and we should limit it to only that.
   
   I agree. I'm pretty sure that users who use the current workaround also explicitly set `max_active_run=1`  in their dag.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on PR #29909:
URL: https://github.com/apache/airflow/pull/29909#issuecomment-1455003651

   I’m not sure a new directive is worthwhile tbh.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1136557865


##########
airflow/models/dag.py:
##########
@@ -546,6 +553,9 @@ def __init__(
         self.last_loaded = timezone.utcnow()
         self.safe_dag_id = dag_id.replace(".", "__dot__")
         self.max_active_runs = max_active_runs
+        if self.timetable.limit_active_runs is not None:
+            if self.timetable.limit_active_runs < self.max_active_runs:
+                self.max_active_runs = self.timetable.limit_active_runs

Review Comment:
   Personally I feel it would be reasonable to fail the DAG since both the timetable and `max_active_runs` are DAG-level arguments and they can be resolved quite easily by the user. Emitting a DagWarning is reasonable too if an error is too aggressive.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] SamWheating commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "SamWheating (via GitHub)" <gi...@apache.org>.
SamWheating commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125601470


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   >  Can we provide data interval as (None, current_start_date) for the first run and (previous_start_date, current_start_date) for the next dag runs?
   
   Great suggestion, I'll add this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1138308750


##########
airflow/timetables/base.py:
##########
@@ -135,6 +135,15 @@ class Timetable(Protocol):
     This should be a list of field names on the DAG run object.
     """
 
+    active_runs_limit: int | None = None
+    """Override the max_active_runs parameter of any DAGs using this timetable.
+    
+    This is called during DAG initializing, and will set the max_active_runs if 

Review Comment:
   ```suggestion
       This is called during DAG initializing, and will set the max_active_runs if
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1136558147


##########
airflow/timetables/base.py:
##########
@@ -135,6 +135,15 @@ class Timetable(Protocol):
     This should be a list of field names on the DAG run object.
     """
 
+    limit_active_runs: int | None = None

Review Comment:
   This name sounds like a boolean, perhaps `active_run_limit` would be better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an artificial limitation like this. I think in some cases a user might want to have multiple runs executing at all times (for example, a job with multiple stages which uses depends_on_past for continuous pipelined execution). Additionally, a similar hazard already exists with schedule_interval="* * * * *" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic tasks (especially when we get "depth-first execution" working rather than with spawning multiple dag runs). The difference vs. the `* * * * *` is that there we know that the speed of creating of the DagRuns will be 1/minute. Full stop. With "continuous" allowing for multiple dag runs - what is the speed of creation of those? Can we limit that? In `* * * * *` case, every dag run  has it's own "per minute" logical date and displaying it in the UI makes perfect sense.  In case of "continuous" we can of course enforce monotonic time for logical date somehow but having 100s of dagruns differing by microseconds sounds strange.
   
   I think when you try to describe it in the docs in the way to explain the users intricacies of such approach where you have more than one "max_active_run" and explain in detail what they can expect - you will see the problem clearly.  For example you will have to explain that the speed of ramping up such running dagruns and maximum number of parallel runs will depend not only on scheduler parameters https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule but also on other DAGs being scheduled at the same time (because they will be competing for the same dag run limit), the speed of DB, scheduler and number of schedulers you run - all connected in a non-trivial way and next to impossible to control by the user as it will change dynamically. Contrast this with dynamic task mapping where you have complete control not only over the maximum number of task instances but hopefully in near future also independently for each mapped tas
 k group how many of those tasks can run in parallell : https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will have. If people will start using continuous to  spawn massive number of parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely not ready for that. Set max_active_run to 1000 and try to visualize those dagruns in the current UI. That won't work. No pagination, no way to limit the results and navigate between them, no way to really distinguish one dag run from another. This is - again in stark contrast in all the investment that @bbovenzi made into making dynamic task UI navigable and usable when you have 100s and 1000s dynamically mapped tasks (Task Groups soon I hope). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an artificial limitation like this. I think in some cases a user might want to have multiple runs executing at all times (for example, a job with multiple stages which uses depends_on_past for continuous pipelined execution). Additionally, a similar hazard already exists with schedule_interval="* * * * *" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic tasks (especially when we get "depth-first execution" working rather than with spawning multiple dag runs). The difference vs. the `* * * * *` is that there we know that the speed of creating of the DagRuns will be 1/minute. Full stop. With "continuous" allowing for multiple dag runs - what is the speed of creation of those? Can we limit that? In `* * * * *` case, every dag run  has it's own "per minute" logical date and displaying it in the UI makes perfect sense.  In case of "continuous" we can of course enforce monotonic time for logical date somehow but having 100s of dagruns differing by microseconds sounds strange.
   
   I think when you try to describe it in the docs in the way to explain the users intricacies of such approach where you have more than one "max_active_run" and explain in detail what they can expect - you will see the problem clearly.  For example you will have to explain that the speed of ramping up such running dagruns and maximum number of parallel runs will depend not only on scheduler parameters https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule but also on other DAGs being scheduled at the same time (because they will be competing for the same dag run limit), the speed of scheduler and number of schedulers - all connected in a non-trivial way and next to impossible to control by the user as it will change dynamically. Contrast this with dynamic task mapping where you have complete control not only over the maximum number of task instances but hopefully in near future also independently for each mapped task group how 
 many of those tasks can run in parallell : https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will have. If people will start using continuous to  spawn massive number of parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely not ready for that. Set max_active_run to 1000 and try to visualize those dagruns in the current UI. That won't work. No pagination, no way to limit the results and navigate between them, no way to really distinguish one dag run from another. This is - again in stark contrast in all the investment that @bbovenzi made into making dynamic task UI navigable and usable when you have 100s and 1000s dynamically mapped tasks (Task Groups soon I guess). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] SamWheating commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "SamWheating (via GitHub)" <gi...@apache.org>.
SamWheating commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1134244230


##########
airflow/models/dag.py:
##########
@@ -545,7 +552,10 @@ def __init__(
         self.template_undefined = template_undefined
         self.last_loaded = timezone.utcnow()
         self.safe_dag_id = dag_id.replace(".", "__dot__")
-        self.max_active_runs = max_active_runs
+        if isinstance(self.timetable, ContinuousTimetable):

Review Comment:
   This is a good idea, and could probably be used for other future timetables - thanks for the suggestion, I'll take a look at this later today / maybe tomorrow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1133515864


##########
airflow/models/dag.py:
##########
@@ -545,7 +552,10 @@ def __init__(
         self.template_undefined = template_undefined
         self.last_loaded = timezone.utcnow()
         self.safe_dag_id = dag_id.replace(".", "__dot__")
-        self.max_active_runs = max_active_runs
+        if isinstance(self.timetable, ContinuousTimetable):

Review Comment:
   We could add a new hook in the timetable protocol for this. Most timetables would return None (not limit) and ContinuousTimetable returns 1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] SamWheating commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "SamWheating (via GitHub)" <gi...@apache.org>.
SamWheating commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1138094579


##########
airflow/models/dag.py:
##########
@@ -546,6 +553,9 @@ def __init__(
         self.last_loaded = timezone.utcnow()
         self.safe_dag_id = dag_id.replace(".", "__dot__")
         self.max_active_runs = max_active_runs
+        if self.timetable.limit_active_runs is not None:
+            if self.timetable.limit_active_runs < self.max_active_runs:
+                self.max_active_runs = self.timetable.limit_active_runs

Review Comment:
   I'm fine with an error - that way we won't have any confusing mismatch between the code and the actual behaviour of a running DAG.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] SamWheating commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "SamWheating (via GitHub)" <gi...@apache.org>.
SamWheating commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125601220


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > Should we enforce "max_active_runs=1" for @continuous run ? If I understand correctly this is the only reasonable setting for @continuous, because if max_active_runs is not set or set to high number, scheduler will start scheduling and runnning a lot of dagruns for such dag.
   
   I thought about this but ultimately thought it would be weird to impose an artificial limitation like this. I think in some cases a user might want to have multiple runs executing at all times (for example, a job with multiple stages which uses `depends_on_past` for continuous pipelined execution). Additionally, a similar hazard already exists with `schedule_interval="* * * * *"` which could create many jobs quite quickly.
   
   So I would advocate for leaving this as-is and continuing to use `max_active_runs` as the limit, we can include some warnings about this potential hazard in the docs. Thoughts?
   
   > Besides this change also needs documentation and examples
   
   Yup, I'll get to work on those shortly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an artificial limitation like this. I think in some cases a user might want to have multiple runs executing at all times (for example, a job with multiple stages which uses depends_on_past for continuous pipelined execution). Additionally, a similar hazard already exists with schedule_interval="* * * * *" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic tasks (especially when we get "depth-first execution" working rather than with spawning multiple dag runs). The difference vs. the `* * * * *` is that there we know that the speed of creating of the DagRuns will be 1/minute. Full stop. With "continuous" allowing for multiple dag runs - what is the speed of creation of those? Can we limit that? In `* * * * *` case, every dag run  has it's own "per minute" logical date and displaying it in the UI makes perfect sense.  In case of "continuous" we can of course enforce monotonic time for logical date somehow but having 100s of dagruns differing by microseconds sounds strange (and we will have a very likely race problem with multiple schedulers trying to create dag runs with exact-same logical date at the same tim)
   
   I think when you try to describe it in the docs in the way to explain the users intricacies of such approach where you have more than one "max_active_run" and explain in detail what they can expect - you will see the problem clearly.  For example you will have to explain that the speed of ramping up such running dagruns and maximum number of parallel runs will depend not only on scheduler parameters https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule but also on other DAGs being scheduled at the same time (because they will be competing for the same dag run limit), the speed of DB, scheduler and number of schedulers you run - all connected in a non-trivial way and next to impossible to control by the user as it will change dynamically. Contrast this with dynamic task mapping where you have complete control not only over the maximum number of task instances but hopefully in near future also independently for each mapped tas
 k group how many of those tasks can run in parallell : https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will have. If people will start using continuous to  spawn massive number of parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely not ready for that. Set max_active_run to 1000 and try to visualize those dagruns in the current UI. That won't work. No pagination, no way to limit the results and navigate between them, no way to really distinguish one dag run from another. This is - again in stark contrast in all the investment that @bbovenzi made into making dynamic task UI navigable and usable when you have 100s and 1000s dynamically mapped tasks (Task Groups soon I hope). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125576178


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   Should we enforce "max_active_runs=1" for `@continuous` run ? If I understand correctly this is the only reasonable setting for @continuous, because if max_active_runs is not set or set to high number, scheduler will start scheduling and runnning a lot of dagruns for such dag.
   
   Do I understand corrrectly?
   
   Besides this change also needs documentation and examples 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1133515864


##########
airflow/models/dag.py:
##########
@@ -545,7 +552,10 @@ def __init__(
         self.template_undefined = template_undefined
         self.last_loaded = timezone.utcnow()
         self.safe_dag_id = dag_id.replace(".", "__dot__")
-        self.max_active_runs = max_active_runs
+        if isinstance(self.timetable, ContinuousTimetable):

Review Comment:
   We could add a new hook in the timetable protocol for this. Most timetables would return None (no limit) and ContinuousTimetable returns 1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1138342857


##########
airflow/timetables/base.py:
##########
@@ -135,6 +135,15 @@ class Timetable(Protocol):
     This should be a list of field names on the DAG run object.
     """
 
+    active_runs_limit: int | None = None
+    """Override the max_active_runs parameter of any DAGs using this timetable.
+    

Review Comment:
   ```suggestion
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1136504568


##########
airflow/models/dag.py:
##########
@@ -858,7 +868,7 @@ def infer_automated_data_interval(self, logical_date: datetime) -> DataInterval:
         DO NOT use this method is there is a known data interval.
         """
         timetable_type = type(self.timetable)
-        if issubclass(timetable_type, (NullTimetable, OnceTimetable)):
+        if issubclass(timetable_type, (NullTimetable, OnceTimetable, ContinuousTimetable)):

Review Comment:
   This is not needed. The function is designed for older runs created prior to Airflow 2.2 (when timetable was introduced), and it should not be possible for a `ContinuousTimetable` to reach here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] SamWheating commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "SamWheating (via GitHub)" <gi...@apache.org>.
SamWheating commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1136504088


##########
airflow/models/dag.py:
##########
@@ -546,6 +553,9 @@ def __init__(
         self.last_loaded = timezone.utcnow()
         self.safe_dag_id = dag_id.replace(".", "__dot__")
         self.max_active_runs = max_active_runs
+        if self.timetable.limit_active_runs is not None:
+            if self.timetable.limit_active_runs < self.max_active_runs:
+                self.max_active_runs = self.timetable.limit_active_runs

Review Comment:
   Do you think its worth logging a warning when we invisibly override the DAG config like this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an artificial limitation like this. I think in some cases a user might want to have multiple runs executing at all times (for example, a job with multiple stages which uses depends_on_past for continuous pipelined execution). Additionally, a similar hazard already exists with schedule_interval="* * * * *" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic tasks (especially when we get "depth-first execution" working rather than with spawning multiple dag runs). The difference vs. the "* * * * * *" is that there we know that the speed of creating of the DagRuns will be 1/minute. Full stop. With "continuous" allowing for multiple dag runs - what is the speed of creation of those? Can we limit that? In "* * * * *" case, every dag run  has it's own "per minute" logical date and displaying it in the UI makes perfect sense.  In case of "continuous" we can of course enforce monotonic time for logical date somehow but having 100s of dagruns differing by microseconds sounds strange.
   
   I think when you try to describe it in the docs in the way to explain the users intricacies of such approach where you have more than one "max_active_run" and explain in detail what they can expect. For example you will have to explain that the speed of ramping up such running dagruns and maximum number of parallel runs will depend not only on scheduler parameters https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule but also on other DAGs being scheduled at the same time (because they will be competing for the same dag run limit), the speed of scheduler and number of schedulers - all connected in a non-trivial way and next to impossible to control by the user as it will change dynamically. Contrast this with dynamic task mapping where you have complete control not only over the maximum number of task instances but hopefully in near future also independently for each mapped task group how many of those tasks can run in paral
 lell : https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will have. If people will start using continuous to  spawn massive number of parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely not ready for that. Set max_active_run to 1000 and try to visualize those dagruns in the current UI. That won't work. No pagination, no way to limit the results and navigate between them, no way to really distinguish one dag run from another. This is - again in stark contrast in all the investment that @bbovenzi made into making dynamic task UI navigable and usable when you have 100s and 1000s dynamically mapped tasks (Task Groups soon I guess). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125629546


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > I thought about this but ultimately thought it would be weird to impose an artificial limitation like this. I think in some cases a user might want to have multiple runs executing at all times (for example, a job with multiple stages which uses depends_on_past for continuous pipelined execution). Additionally, a similar hazard already exists with schedule_interval="* * * * *" which could create many jobs quite quickly.
   
   I think this is quite different case. And it is better handled with dynamic tasks (especially when we get "depth-first execution" working rather than with spawning multiple dag runs). The difference vs. the `* * * * *` is that there we know that the speed of creating of the DagRuns will be 1/minute. Full stop. With "continuous" allowing for multiple dag runs - what is the speed of creation of those? Can we limit that? In `* * * * *` case, every dag run  has it's own "per minute" logical date and displaying it in the UI makes perfect sense.  In case of "continuous" we can of course enforce monotonic time for logical date somehow but having 100s of dagruns differing by microseconds sounds strange.
   
   I think when you try to describe it in the docs in the way to explain the users intricacies of such approach where you have more than one "max_active_run" and explain in detail what they can expect. For example you will have to explain that the speed of ramping up such running dagruns and maximum number of parallel runs will depend not only on scheduler parameters https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#max-dagruns-per-loop-to-schedule but also on other DAGs being scheduled at the same time (because they will be competing for the same dag run limit), the speed of scheduler and number of schedulers - all connected in a non-trivial way and next to impossible to control by the user as it will change dynamically. Contrast this with dynamic task mapping where you have complete control not only over the maximum number of task instances but hopefully in near future also independently for each mapped task group how many of those tasks can run in paral
 lell : https://github.com/apache/airflow/pull/29094.
   
   Also those pale in comparision with the very bad UI experience people will have. If people will start using continuous to  spawn massive number of parallell dagruns - our UI is (in contrast to dynamic task mapping) absolutely not ready for that. Set max_active_run to 1000 and try to visualize those dagruns in the current UI. That won't work. No pagination, no way to limit the results and navigate between them, no way to really distinguish one dag run from another. This is - again in stark contrast in all the investment that @bbovenzi made into making dynamic task UI navigable and usable when you have 100s and 1000s dynamically mapped tasks (Task Groups soon I guess). 
   
   I'd say the only real good use case for continuous is with max_active_run =1 and we should limit it to only that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] hussein-awala commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "hussein-awala (via GitHub)" <gi...@apache.org>.
hussein-awala commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1125588592


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   @potiuk :+1: I was thinking the same question
   
   Also this one: Can we provide data interval as (None, current_start_date) for the first run and (previous_start_date, current_start_date) for the next dag runs? this makes the new timetable usefull for other use cases like processing data in batches ASAP



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on PR #29909:
URL: https://github.com/apache/airflow/pull/29909#issuecomment-1455026267

   > I’m not sure a new directive is worthwhile tbh.
   
   I think it's an interesting pattern - a number of users asked for it and any attempt to do it without such directive is pretty cumbersome. While I initially had the same thought, Looking at how simple it will be for the users to use it, I think it is worthwile to add it. 
   
   I generally also think we should have more built-in time-tables that serve various cases like that. The custom timatable interface is prohibitively complex evn for an experienced Python developer and testing it is next to impossible. A good example of that is CronTriggerTimetable - which even is written by us, apparently exposes race condition: https://github.com/apache/airflow/issues/27399 where it occasionally looses one tasks. 
   
   So as a community I think we should invest IMHO in having a few more "generic", "robust" and "declaratively configurable" timetables that should serve a number of common cases as otherwise we are asking our users for too much of an effort to develop their own custom timetables. 
   
   This is just one example of such case and imho more cases should follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] SamWheating commented on a diff in pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "SamWheating (via GitHub)" <gi...@apache.org>.
SamWheating commented on code in PR #29909:
URL: https://github.com/apache/airflow/pull/29909#discussion_r1133149457


##########
airflow/timetables/simple.py:
##########
@@ -108,6 +109,37 @@ def next_dagrun_info(
         return DagRunInfo.exact(run_after)
 
 
+class ContinuousTimetable(_TrivialTimetable):
+    """Timetable that schedules continually, while still respecting start_date and end_date
+
+    This corresponds to ``schedule="@continuous"``.
+    """
+
+    description: str = "As frequently as possible while still obeying max_active_runs"

Review Comment:
   > Also this one: Can we provide data interval as (None, current_start_date) for the first run and (previous_start_date, current_start_date) for the next dag runs?
   
   Looking into this, I think that the data interval must be composed of valid Datetimes (not None) and I think it would be most accurate to use `(dag.start_date, utcnow())` for the first interval and `(previous_end_date, utcnow())` for successive runs.
   
   Does this sound agreeable, @hussein-awala ?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #29909: Adding ContinuousTimetable and support for @continuous schedule_interval

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk merged PR #29909:
URL: https://github.com/apache/airflow/pull/29909


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org