You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/12/06 08:25:13 UTC

[GitHub] [airflow] eskarimov opened a new pull request #20062: Use original task's `start_date` if a task continues after deferral

eskarimov opened a new pull request #20062:
URL: https://github.com/apache/airflow/pull/20062


   The PR intends to close the issue #19382, using task's original `start_date` when it continues after deferral.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#discussion_r764756302



##########
File path: airflow/models/taskinstance.py
##########
@@ -1200,7 +1200,8 @@ def check_and_change_state_before_execution(
             # Attempt 0 for the first attempt).
             # Set the task start date. In case it was re-scheduled use the initial
             # start date that is recorded in task_reschedule table
-            self.start_date = timezone.utcnow()
+            # If the task continues after being deferred (next_method is set), use the original start_date
+            self.start_date = self.start_date if self.next_method else timezone.utcnow()

Review comment:
       The scheduler calls `set_state()` to defer a task, so I was suggesting setting `start_date` to None when that happens (the scheduler would commit this to the database), so when the task instance was fetched after deferral, the database record would have `start_date` of None. But if you feel clearing `start_date` in the database is wrong (based on your comment above), the best approach would be what you have right now, plus a comment reminding future maintainers to keep an eye on this part of the code base when they change the semantic of `next_method`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#issuecomment-997868326


   Could someone check the PR if it looks good, please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#discussion_r764756302



##########
File path: airflow/models/taskinstance.py
##########
@@ -1200,7 +1200,8 @@ def check_and_change_state_before_execution(
             # Attempt 0 for the first attempt).
             # Set the task start date. In case it was re-scheduled use the initial
             # start date that is recorded in task_reschedule table
-            self.start_date = timezone.utcnow()
+            # If the task continues after being deferred (next_method is set), use the original start_date
+            self.start_date = self.start_date if self.next_method else timezone.utcnow()

Review comment:
       The scheduler calls `set_state()` to defer a task, so I was suggesting setting `start_date` to None when that happens (the scheduler would commit this change to the database), so when the task instance was fetched after deferral, the database record would have `start_date` of None. But if you feel clearing `start_date` in the database is wrong (based on your comment above), the best approach would be what you have right now, plus a comment reminding future maintainers to keep an eye on this part of the code base when they change the semantic of `next_method`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#discussion_r762953978



##########
File path: airflow/models/taskinstance.py
##########
@@ -1200,7 +1200,8 @@ def check_and_change_state_before_execution(
             # Attempt 0 for the first attempt).
             # Set the task start date. In case it was re-scheduled use the initial
             # start date that is recorded in task_reschedule table
-            self.start_date = timezone.utcnow()
+            # If the task continues after being deferred (next_method is set), use the original start_date
+            self.start_date = self.start_date if self.next_method else timezone.utcnow()

Review comment:
       What would `start_date` be if the task is not deferred? If it’s `None`, perhaps a better solution would be
   
   ```python
   if self.start_date is None:
       self.start_date = timezone.utcnow()
   ```
   
   but perhaps it would not be None when being rescheduled? I guess what I’m trying to say is we might want to consider more context for this, since relying on `next_method` is a bit unstable since that value is meaning something else and it’s too easy to change its logic elsewhere without realising it causes side effect here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#discussion_r763762267



##########
File path: airflow/models/taskinstance.py
##########
@@ -1200,7 +1200,8 @@ def check_and_change_state_before_execution(
             # Attempt 0 for the first attempt).
             # Set the task start date. In case it was re-scheduled use the initial
             # start date that is recorded in task_reschedule table
-            self.start_date = timezone.utcnow()
+            # If the task continues after being deferred (next_method is set), use the original start_date
+            self.start_date = self.start_date if self.next_method else timezone.utcnow()

Review comment:
       It probably makes the most sense to change `start_date` in `TaskInstance.set_state()`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #20062:
URL: https://github.com/apache/airflow/pull/20062


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on a change in pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
eskarimov commented on a change in pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#discussion_r763154994



##########
File path: airflow/models/taskinstance.py
##########
@@ -1200,7 +1200,8 @@ def check_and_change_state_before_execution(
             # Attempt 0 for the first attempt).
             # Set the task start date. In case it was re-scheduled use the initial
             # start date that is recorded in task_reschedule table
-            self.start_date = timezone.utcnow()
+            # If the task continues after being deferred (next_method is set), use the original start_date
+            self.start_date = self.start_date if self.next_method else timezone.utcnow()

Review comment:
       You're totally right, when a new task instance starts the first time, then `start_date` would be `None`.
   In case a task was rescheduled, then `start_date` will be equal to the previous run attempt.
   
   Does it make sense to consider setting `start_date` to `None` when we clear the task instance?
   https://github.com/apache/airflow/blob/1349cc62df9fa872fd24865ade7b281bb229d70c/airflow/models/taskinstance.py#L202-L204




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on a change in pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
eskarimov commented on a change in pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#discussion_r763154994



##########
File path: airflow/models/taskinstance.py
##########
@@ -1200,7 +1200,8 @@ def check_and_change_state_before_execution(
             # Attempt 0 for the first attempt).
             # Set the task start date. In case it was re-scheduled use the initial
             # start date that is recorded in task_reschedule table
-            self.start_date = timezone.utcnow()
+            # If the task continues after being deferred (next_method is set), use the original start_date
+            self.start_date = self.start_date if self.next_method else timezone.utcnow()

Review comment:
       You're totally right, when a new task instance starts the first time, then `start_date` would be `None`.
   In case a task was rescheduled, then `start_date` would be equal to the previous run attempt.
   
   Does it make sense to consider setting `start_date` to `None` when we clear the task instance?
   https://github.com/apache/airflow/blob/1349cc62df9fa872fd24865ade7b281bb229d70c/airflow/models/taskinstance.py#L202-L204




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#issuecomment-1004657257


   I've rebased the PR to run the full matrix of tests, everything looks green, do you think we could merge the PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#issuecomment-1000758162


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on a change in pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
eskarimov commented on a change in pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#discussion_r764726355



##########
File path: airflow/models/taskinstance.py
##########
@@ -1200,7 +1200,8 @@ def check_and_change_state_before_execution(
             # Attempt 0 for the first attempt).
             # Set the task start date. In case it was re-scheduled use the initial
             # start date that is recorded in task_reschedule table
-            self.start_date = timezone.utcnow()
+            # If the task continues after being deferred (next_method is set), use the original start_date
+            self.start_date = self.start_date if self.next_method else timezone.utcnow()

Review comment:
       I'm a bit confused here now:
   - There's no other indicator aside `TaskInstance.next_method` to use for identifying if the task was returned after deferral.
   - A task instance is refreshed from DB each time it starts. It happens in function `check_and_change_state_before_execution()`
   https://github.com/apache/airflow/blob/1349cc62df9fa872fd24865ade7b281bb229d70c/airflow/models/taskinstance.py#L1168-L1174
   where it got `start_date` assigned from the previous run.
   If we want to use the snippet from above with `if self.start_date is None:`, then `start_date` needs to be `None` in DB at the moment of refresh, but it doesn't feel right to set it to `None` in DB before the new task instance has actually started executing.
   - `TaskInstance.set_state()` isn't used inside `TaskInstance` class, I've checked for usages, used in different places like `airflow.jobs`, `airflow.executors.debug_executor`, etc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on a change in pull request #20062: Use original task's `start_date` if a task continues after deferral

Posted by GitBox <gi...@apache.org>.
eskarimov commented on a change in pull request #20062:
URL: https://github.com/apache/airflow/pull/20062#discussion_r764870343



##########
File path: airflow/models/taskinstance.py
##########
@@ -1200,7 +1200,8 @@ def check_and_change_state_before_execution(
             # Attempt 0 for the first attempt).
             # Set the task start date. In case it was re-scheduled use the initial
             # start date that is recorded in task_reschedule table
-            self.start_date = timezone.utcnow()
+            # If the task continues after being deferred (next_method is set), use the original start_date
+            self.start_date = self.start_date if self.next_method else timezone.utcnow()

Review comment:
       If `start_date` would be set to `None` when the scheduler goes to defer the task, then we won't be able to know what was the original `start_date`, i.e. when the task before deferral started.
   Because we want to count the overall execution time, including time before deferral + defer time.
   
   Sorry, I haven't explained it correctly what I meant about clearing `start_date` in database: 
   it's getting tricky, when the task is up for retry.
   In case of the second and further attempts `refresh_from_db()` sets `start_date` equal to the previous try attempt, i.e. it'd be `not None`.
   Which means if we use the snippet above
   ```python
   if self.start_date is None:
       self.start_date = timezone.utcnow()
   ```
   then in case of re-try, the task would continue execution with the `start_date` from the previous attempt, which is wrong.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org