You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/01/11 10:20:57 UTC

[GitHub] [airflow] hterik opened a new pull request #20806: Don't retry kubernetes pod start forever.

hterik opened a new pull request #20806:
URL: https://github.com/apache/airflow/pull/20806


   Use a progressive backoff timer and once a fixed number
   of retries has been reached, fail the task.
   
   Fixes: #19657
   
   WIP disclaimer: Haven't tested this fully yet. Preliminary high-level review would still be appreciated in case of any obvious design flaw.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ferruzzi commented on a change in pull request #20806: Don't retry kubernetes pod start forever.

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on a change in pull request #20806:
URL: https://github.com/apache/airflow/pull/20806#discussion_r782594456



##########
File path: airflow/executors/kubernetes_executor.py
##########
@@ -50,8 +50,17 @@
 from airflow.utils.session import provide_session
 from airflow.utils.state import State
 
-# TaskInstance key, command, configuration, pod_template_file
-KubernetesJobType = Tuple[TaskInstanceKey, CommandType, Any, Optional[str]]
+
+class KubernetesJobType(NamedTuple):
+    key: TaskInstanceKey
+    command: CommandType
+    config: Any
+    pod_template_file: Optional[str]
+    next_allowed_retry: datetime
+    retry_count: int
+
+
+RETRY_BACKOFF_SECONDS = [1, 2, 5, 10, 10, 30, 60, 60]

Review comment:
       As a counterpoint, perhaps a formula for the backoff (exponential backoff? doubling?) and set a limit to the number of retries instead of defining the exact times like this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on pull request #20806: Don't retry kubernetes pod start forever.

Posted by GitBox <gi...@apache.org>.
eladkal commented on pull request #20806:
URL: https://github.com/apache/airflow/pull/20806#issuecomment-1067666194


   @hterik can you rebase and resolve conflicts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #20806: Don't retry kubernetes pod start forever.

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #20806:
URL: https://github.com/apache/airflow/pull/20806#discussion_r782672503



##########
File path: airflow/executors/kubernetes_executor.py
##########
@@ -50,8 +50,17 @@
 from airflow.utils.session import provide_session
 from airflow.utils.state import State
 
-# TaskInstance key, command, configuration, pod_template_file
-KubernetesJobType = Tuple[TaskInstanceKey, CommandType, Any, Optional[str]]
+
+class KubernetesJobType(NamedTuple):
+    key: TaskInstanceKey
+    command: CommandType
+    config: Any
+    pod_template_file: Optional[str]
+    next_allowed_retry: datetime
+    retry_count: int
+
+
+RETRY_BACKOFF_SECONDS = [1, 2, 5, 10, 10, 30, 60, 60]

Review comment:
       +1 to a formula (e.g. exponential but no longer than 60 seconds)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #20806: Don't retry kubernetes pod start forever.

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #20806:
URL: https://github.com/apache/airflow/pull/20806#issuecomment-1054516905


   closed/reopened to rebuild


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #20806: Don't retry kubernetes pod start forever.

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #20806:
URL: https://github.com/apache/airflow/pull/20806#issuecomment-1060015601


   Needs rebase now, unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #20806: Don't retry kubernetes pod start forever.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #20806:
URL: https://github.com/apache/airflow/pull/20806#issuecomment-1052842272


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed pull request #20806: Don't retry kubernetes pod start forever.

Posted by GitBox <gi...@apache.org>.
potiuk closed pull request #20806:
URL: https://github.com/apache/airflow/pull/20806


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] hterik commented on pull request #20806: Don't retry kubernetes pod start forever.

Posted by GitBox <gi...@apache.org>.
hterik commented on pull request #20806:
URL: https://github.com/apache/airflow/pull/20806#issuecomment-1054476434


   Looks like some flaky CI issue, can actions be retried? 
   Rest of change should be ready for review now btw. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] hterik commented on a change in pull request #20806: Don't retry kubernetes pod start forever.

Posted by GitBox <gi...@apache.org>.
hterik commented on a change in pull request #20806:
URL: https://github.com/apache/airflow/pull/20806#discussion_r782007176



##########
File path: airflow/executors/kubernetes_executor.py
##########
@@ -599,6 +608,9 @@ def sync(self) -> None:
         for _ in range(self.kube_config.worker_pods_creation_batch_size):
             try:
                 task = self.task_queue.get_nowait()
+                if datetime.utcnow() < task.next_allowed_retry:
+                    continue

Review comment:
       Not sure how this will work together with the self.event_scheduler.run(blocking=False) on line 646 below. Will it keep looping very fast as long as the queue has items or is there some other delay built in here?

##########
File path: airflow/executors/kubernetes_executor.py
##########
@@ -50,8 +50,17 @@
 from airflow.utils.session import provide_session
 from airflow.utils.state import State
 
-# TaskInstance key, command, configuration, pod_template_file
-KubernetesJobType = Tuple[TaskInstanceKey, CommandType, Any, Optional[str]]
+
+class KubernetesJobType(NamedTuple):
+    key: TaskInstanceKey
+    command: CommandType
+    config: Any
+    pod_template_file: Optional[str]
+    next_allowed_retry: datetime
+    retry_count: int
+
+
+RETRY_BACKOFF_SECONDS = [1, 2, 5, 10, 10, 30, 60, 60]

Review comment:
       Are these numbers reasonable? Or should it continue longer?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org