You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/13 09:07:20 UTC

[GitHub] [airflow] Adil-Ibragimov opened a new pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Adil-Ibragimov opened a new pull request #10315:
URL: https://github.com/apache/airflow/pull/10315


   Added an optional flags
   
   related: #8722
   
   This is rebased version of https://github.com/apache/airflow/pull/8734
   
   ... This will surface OOMKilled cases by kubernetes (#8722)
   
   Added an optional flag that logs container statuses when the pod fails. It helps when a container runs out of memory to log the OOMKilled status of it. Users then would know they need to request more memory for their tasks.
   
   Make sure to mark the boxes below before creating PR: [x]
   
   - [x]  Description above provides context of the change
   - [x]  Unit tests coverage for changes (not needed for documentation changes)
   - [x]  Target Github ISSUE in description if exists
   - [x]  Commits follow "How to write a good git commit message"
   - [x]  Relevant documentation is updated including usage instructions.
   - [x]  I will engage committers as explained in Contribution Workflow Example.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] Adil-Ibragimov commented on pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
Adil-Ibragimov commented on pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#issuecomment-673403309


   @ashb @chrismclennon please review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] Adil-Ibragimov commented on pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
Adil-Ibragimov commented on pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#issuecomment-682328260


   @ashb deleted log_containers_statuses_on_failure params. See comment from chrismclennon above about what logs we want to get from container.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] Adil-Ibragimov commented on a change in pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
Adil-Ibragimov commented on a change in pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#discussion_r484638836



##########
File path: airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
##########
@@ -350,6 +359,19 @@ def _set_name(self, name):
         validate_key(name, max_length=220)
         return re.sub(r'[^a-z0-9.-]+', '-', name.lower())
 
+    def _log_pod_events_on_failure(self, pod, launcher):
+        if self.log_events_on_failure:
+            for event in launcher.read_pod_events(pod).items:
+                self.log.error("Pod Event: %s - %s", event.reason, event.message)
+
+    def _log_pod_status_on_failure(self, pod, launcher):
+        try:
+            pod_status = launcher.read_pod_status(pod)
+            self.log.error('Pod not succeeded, look for OOMKilled in containers statuses.')
+            self.log.error(pod_status.status.container_statuses)
+        except AirflowException as ex:
+            self.log.exception('Failed to get container statuses: %s', ex)

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on a change in pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
dimberman commented on a change in pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#discussion_r490458722



##########
File path: airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
##########
@@ -138,6 +138,9 @@ class KubernetesPodOperator(BaseOperator):  # pylint: disable=too-many-instance-
     :type init_containers: list[kubernetes.client.models.V1Container]
     :param log_events_on_failure: Log the pod's events if a failure occurs
     :type log_events_on_failure: bool
+    :param retry_only_on_pod_launching_failure: retry logic is only effective if pod launching fails. This

Review comment:
       I'm confused here, how does this  know if the image has failed vs. the other arguments (e.g. a failed volume mount?)

##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -187,6 +187,23 @@ def read_pod_logs(self, pod: V1Pod, tail_lines: int = 10):
                 'There was an error reading the kubernetes API: {}'.format(e)
             )
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),

Review comment:
       Is there a reason this is hard-coded? Should # of attempts be configurable?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#issuecomment-673359637


   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, pylint and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/master/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/master/docs/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/master/BREEZE.rst) for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better 🚀.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://apache-airflow-slack.herokuapp.com/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] Adil-Ibragimov commented on pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
Adil-Ibragimov commented on pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#issuecomment-691803644


   @ashb Ping!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] chrismclennon commented on pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
chrismclennon commented on pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#issuecomment-681338417


   Chiming in since I work with Adil and we make use of this in our Airflow installation. Our users have found this helpful to debug any Kubernetes related errors that would crop up. Things like OOMKill, or any event that would prevent the pod from launching, such as the image not being found, which are otherwise unreported in the logs.
   
   An example of what the logs look like:
   ```
   [2020-08-19 22:22:34,666] {IndeedKubernetesPodOperator.py:324} ERROR - Pod Event: Scheduled - Successfully assigned airflow--airflow/print-numbers-e3e77321 to ip-10-104-4-119.ec2.internal
   [2020-08-19 22:22:34,666] {IndeedKubernetesPodOperator.py:324} ERROR - Pod Event: Pulling - Pulling image "alpine"
   [2020-08-19 22:22:34,666] {IndeedKubernetesPodOperator.py:324} ERROR - Pod Event: Pulled - Successfully pulled image "alpine"
   [2020-08-19 22:22:34,666] {IndeedKubernetesPodOperator.py:324} ERROR - Pod Event: Created - Created container base
   [2020-08-19 22:22:34,667] {IndeedKubernetesPodOperator.py:324} ERROR - Pod Event: Started - Started container base
   [2020-08-19 22:22:34,762] {IndeedKubernetesPodOperator.py:330} ERROR - Pod has not succeeded, look for OOMKilled in containers statuses.
   [2020-08-19 22:22:34,762] {IndeedKubernetesPodOperator.py:332} ERROR - [{'container_id': 'docker://52431fc6b1f67074d92804785a73005d83cde2091cac9c3ada2edcb806cca489',
    'image': 'alpine:latest',
    'image_id': 'docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321',
    'last_state': {'running': None, 'terminated': None, 'waiting': None},
    'name': 'base',
    'ready': True,
    'restart_count': 0,
    'state': {'running': {'started_at': datetime.datetime(2020, 8, 19, 22, 21, 5, tzinfo=tzlocal())},
              'terminated': None,
              'waiting': None}}]
   ```
   
   I agree, no need to have the boolean flag. This should be a default. Let's remove that.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#discussion_r483569475



##########
File path: airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
##########
@@ -408,17 +430,18 @@ def create_new_pod_for_operator(self, labels, launcher) -> Tuple[State, k8s.V1Po
 
         self.pod = pod
         self.log.debug("Starting pod:\n%s", yaml.safe_dump(pod.to_dict()))
+        final_state, result = None, None
         try:
             launcher.start_pod(
                 pod,
                 startup_timeout=self.startup_timeout_seconds)
             final_state, result = launcher.monitor_pod(pod=pod, get_logs=self.get_logs)
-        except AirflowException:
-            if self.log_events_on_failure:
-                for event in launcher.read_pod_events(pod).items:
-                    self.log.error("Pod Event: %s - %s", event.reason, event.message)
-            raise
         finally:
+            if final_state != State.SUCCESS:
+                # Before deleting the pod we get events and status of the pod (events can be fetched
+                # after pod deletion but not statuses). For consistency both are fetched before deletion.
+                self._log_pod_events_on_failure(pod, launcher)
+                self._log_pod_status_on_failure(pod, launcher)

Review comment:
       If either of these fail, the pod won't be deleted anymore.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#discussion_r483567482



##########
File path: airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
##########
@@ -294,6 +299,10 @@ def execute(self, context) -> Optional[str]:
                 self.log.info("creating pod with labels %s and launcher %s", labels, launcher)
                 final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
             if final_state != State.SUCCESS:
+                if self.retry_only_on_pod_launching_failure:

Review comment:
       Is there any other way this condition could be hit?
   
   I'm just not 100% sure on the name for this setting -- I can see it useful but (in my sleep deprived state) I'm struggling to parse the docs/name of the param to work out what it actually does.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#discussion_r483568891



##########
File path: airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
##########
@@ -350,6 +359,19 @@ def _set_name(self, name):
         validate_key(name, max_length=220)
         return re.sub(r'[^a-z0-9.-]+', '-', name.lower())
 
+    def _log_pod_events_on_failure(self, pod, launcher):
+        if self.log_events_on_failure:
+            for event in launcher.read_pod_events(pod).items:
+                self.log.error("Pod Event: %s - %s", event.reason, event.message)
+
+    def _log_pod_status_on_failure(self, pod, launcher):
+        try:
+            pod_status = launcher.read_pod_status(pod)
+            self.log.error('Pod not succeeded, look for OOMKilled in containers statuses.')
+            self.log.error(pod_status.status.container_statuses)
+        except AirflowException as ex:
+            self.log.exception('Failed to get container statuses: %s', ex)

Review comment:
       `log.exception` already includes the exception, so:
   
   ```suggestion
               self.log.exception('Failed to get container status')
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] Adil-Ibragimov commented on a change in pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
Adil-Ibragimov commented on a change in pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#discussion_r484642940



##########
File path: airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
##########
@@ -138,6 +138,9 @@ class KubernetesPodOperator(BaseOperator):  # pylint: disable=too-many-instance-
     :type init_containers: list[kubernetes.client.models.V1Container]
     :param log_events_on_failure: Log the pod's events if a failure occurs
     :type log_events_on_failure: bool
+    :param retry_only_on_pod_launching_failure: retry logic is only effective if pod launching fails. This
+        prevents retries when image failures occurs. Useful for non-idempotent tasks.

Review comment:
       Looks like it should be changed like this.
   ```suggestion
           prevents retries when image failures occurs. Useful for idempotent tasks.
   ```

##########
File path: airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
##########
@@ -294,6 +299,10 @@ def execute(self, context) -> Optional[str]:
                 self.log.info("creating pod with labels %s and launcher %s", labels, launcher)
                 final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
             if final_state != State.SUCCESS:
+                if self.retry_only_on_pod_launching_failure:

Review comment:
       Fixed doc.
   Let me know if you have better name for this parameter.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] closed pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #10315:
URL: https://github.com/apache/airflow/pull/10315


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] Adil-Ibragimov commented on a change in pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
Adil-Ibragimov commented on a change in pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#discussion_r484639009



##########
File path: airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
##########
@@ -408,17 +430,18 @@ def create_new_pod_for_operator(self, labels, launcher) -> Tuple[State, k8s.V1Po
 
         self.pod = pod
         self.log.debug("Starting pod:\n%s", yaml.safe_dump(pod.to_dict()))
+        final_state, result = None, None
         try:
             launcher.start_pod(
                 pod,
                 startup_timeout=self.startup_timeout_seconds)
             final_state, result = launcher.monitor_pod(pod=pod, get_logs=self.get_logs)
-        except AirflowException:
-            if self.log_events_on_failure:
-                for event in launcher.read_pod_events(pod).items:
-                    self.log.error("Pod Event: %s - %s", event.reason, event.message)
-            raise
         finally:
+            if final_state != State.SUCCESS:
+                # Before deleting the pod we get events and status of the pod (events can be fetched
+                # after pod deletion but not statuses). For consistency both are fetched before deletion.
+                self._log_pod_events_on_failure(pod, launcher)
+                self._log_pod_status_on_failure(pod, launcher)

Review comment:
       fixed. Added exception handling in _log_pod_events_on_failure




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] Adil-Ibragimov commented on pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
Adil-Ibragimov commented on pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#issuecomment-686852023


   @ashb ping!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] Adil-Ibragimov commented on pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
Adil-Ibragimov commented on pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#issuecomment-688680867


   @ashb I think I've fixed all comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] Adil-Ibragimov commented on a change in pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
Adil-Ibragimov commented on a change in pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#discussion_r484642741



##########
File path: airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py
##########
@@ -294,6 +299,10 @@ def execute(self, context) -> Optional[str]:
                 self.log.info("creating pod with labels %s and launcher %s", labels, launcher)
                 final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
             if final_state != State.SUCCESS:
+                if self.retry_only_on_pod_launching_failure:

Review comment:
       Maybe I have mistake in doc.
   doc for this param says:
   ```
       :param retry_only_on_pod_launching_failure: retry logic is only effective if pod launching fails. This
           prevents retries when image failures occurs. Useful for non-idempotent tasks.
       :type retry_only_on_pod_launching_failure: bool
   ```
   As far as I understand Idempotence (https://en.wikipedia.org/wiki/Idempotence) the result of idempotent task will be the same if we run it once again. So if it fails then it make no sence to retry, it will fail again.
   Looks like we should change 
   ```
   Useful for non-idempotent tasks.
   ```
   with 
   ```
   Useful for idempotent tasks.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] stale[bot] commented on pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#issuecomment-727619497


   This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #10315: Add retry_only_on_pod_launching_failure and log_container_statuses_on…

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10315:
URL: https://github.com/apache/airflow/pull/10315#issuecomment-799013673


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org