You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/04 14:22:09 UTC

[GitHub] [airflow] dmateusp commented on issue #12136: KubernetesPodOperator breaks with active log-collection for long running tasks

dmateusp commented on issue #12136:
URL: https://github.com/apache/airflow/issues/12136#issuecomment-754002352


   I was able to reproduce this issue after a few days of letting Bash containers run from KubernetesPodOperator, and implement a fix:
   
   I ran two bash tasks, one with the fix (EarnestKubernetesPodOperator) and one without (KubernetesPodOperator):
   
   ```python
       bash_echo_1 = KubernetesPodOperator(
           task_id="bash_echo_1",
           image="bash:4.4",
           name="bash-echo-1",
           cmds=[
               "bash",
               "-c",
               "while true; do sleep 5; done;",
           ],
           do_xcom_push=True,
           get_logs=True,
           **t.DEFAULT_POD_ARGS,
       )
   
       bash_echo_2 = EarnestKubernetesPodOperator(
           task_id="bash_echo_2",
           name="bash-echo-2",
           image="bash:4.4",
           cmds=["bash", "-c", "while true; do sleep 5; done;",],
           get_logs=True,
           **t.DEFAULT_POD_ARGS,
       )
   ```
   
   I let both of these run for a few days and the one without the fix failed a few times while the one with the fix never failed:
   
   <img width="95" alt="Screen Shot 2021-01-04 at 14 07 50" src="https://user-images.githubusercontent.com/8832050/103543765-f3f65480-4e96-11eb-9b7d-05f71a9f1036.png">
   
   ^ First row is `bash_echo_1` and second row is `bash_echo_2`
   
   In the logs of `bash_echo_1` we can see that the issue was reproduced after 3 hours of running:
   
   ```
   [2020-12-31 03:00:36,636] {logging_mixin.py:112} INFO - [2020-12-31 03:00:36,635] {pod_launcher.py:142} INFO - Event: bash-echo-1-3a123499 had an event of type Running
   [2021-01-01 06:12:42,517] {taskinstance.py:1145} ERROR - ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 685, in _update_chunk_length
       self.chunk_left = int(line, 16)
   ValueError: invalid literal for int() with base 16: b''
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
       yield
     File "/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
       self._update_chunk_length()
     File "/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 689, in _update_chunk_length
       raise httplib.IncompleteRead(line)
   http.client.IncompleteRead: IncompleteRead(0 bytes read)
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
       result = task_copy.execute(context=context)
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 251, in execute
       get_logs=self.get_logs)
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 117, in run_pod
       return self._monitor_pod(pod, get_logs)
     File "/home/airflow/.local/lib/python3.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 124, in _monitor_pod
       for line in logs:
     File "/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 796, in __iter__
       for chunk in self.stream(decode_content=True):
     File "/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 560, in stream
       for line in self.read_chunked(amt, decode_content=decode_content):
     File "/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
       self.gen.throw(type, value, traceback)
     File "/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
       raise ProtocolError("Connection broken: %r" % e, e)
   urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
   ```
   
   While the same issue happened in `bash_echo_2` but it was handled by the change in this PR
   
   ```
   [2020-12-31 03:00:36,592] {logging_mixin.py:112} INFO - [2020-12-31 03:00:36,592] {kubernetes_pod_operator.py:270} INFO - Event: bash-echo-2-72839c01adc84a29b3903527d6e6dfba had an event of type Running
   [2021-01-01 06:20:38,083] {logging_mixin.py:112} INFO - [2021-01-01 06:20:38,082] {kubernetes_pod_operator.py:248} INFO - The pod has not logged since the logs were last fetched
   [2021-01-01 06:20:38,373] {logging_mixin.py:112} INFO - [2021-01-01 06:20:38,372] {kubernetes_pod_operator.py:270} INFO - Event: bash-echo-2-72839c01adc84a29b3903527d6e6dfba had an event of type Running
   ```
   
   The only change was to add error handling around `self.read_pod_logs` in the KubernetesPodOperator:
   
   ```python
               try:
                   logs = self.read_pod_logs(pod)
                   for line in logs:
                       self.log.info(line)
               except ProtocolError as pe:
                   _, protocol_exception = pe.args
                   # When no logs are fetched, an IncompleteRead is thrown trying
                   # to decode the stream
                   if str(protocol_exception) == "IncompleteRead(0 bytes read)":
                       self.log.info(
                           "The pod has not logged since the logs were last fetched"
                       )
                   # If the exception is not about an empty stream we raise it
                   else:
                       raise e
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org