You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Liu Xuesi (Jira)" <ji...@apache.org> on 2019/09/29 13:06:00 UTC
[jira] [Created] (AIRFLOW-5571) Kubernetes operator's bug that get logs will make pod exit unexpectedly

Liu Xuesi created AIRFLOW-5571:
----------------------------------

             Summary: Kubernetes operator's bug that get logs will make pod exit unexpectedly
                 Key: AIRFLOW-5571
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5571
             Project: Apache Airflow
          Issue Type: Bug
          Components: operators
    Affects Versions: 1.10.3, 1.10.2, 1.10.1, 1.10.0
            Reporter: Liu Xuesi
         Attachments: k8s_error_log.png

In my project, I manage 200+ jobs in kubernetes, as I work on, I found a critical bug for kubernetes operator.

In pod_launcher.py, *_monitor_pod* function:
{code:python}
        if get_logs:
            logs = self.read_pod_logs(pod) # Here has a retry.
            for line in logs:  # But exception throw from here!
                self.
{code}
in above code, *logs* is a HttpResponse, as it implemented __iter__() function, you can use for loop to print the lines. In the other words, here use a http long connection to get endless log.

There is only try catch over *self.read_pod_logs*, however, If the network is disconnected or jitter occurs, for loop will throw error.

As I have 200+ job run everyday, I can get 4~5 errors everyday, and each error will let monitor think the pod is down, and then mark the task as failed, then retry it. This eventually lead to data error.

 

Below is a typical error log:
{code:java}
 [2019-09-17 20:50:02,532] {logging_mixin.py:95} INFO - [2019-09-17 20:50:02,532] {pod_launcher.py:105} INFO - b'19/09/17 20:50:00 INFO yarn.Client: Application report for application_1565926539066_3866 (state: RUNNING)\n'
[2019-09-17 20:50:02,532] {logging_mixin.py:95} INFO - [2019-09-17 20:50:02,532] {pod_launcher.py:105} INFO - b'19/09/17 20:50:01 INFO yarn.Client: Application report for application_1565926539066_3866 (state: RUNNING)\n'
[2019-09-17 20:50:02,533] {taskinstance.py:1047} ERROR - ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 639, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 397, in _error_catcher
    yield
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 704, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 643, in _update_chunk_length
    raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/airflow/models/taskinstance.py", line 922, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.6/dist-packages/airflow/contrib/operators/k8s_pod_operator.py", line 45, in execute
    super().execute(context)
  File "/usr/local/lib/python3.6/dist-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 148, in execute
    get_logs=self.get_logs)
  File "/usr/local/lib/python3.6/dist-packages/airflow/contrib/kubernetes/pod_launcher.py", line 97, in run_pod
    return self._monitor_pod(pod, get_logs)
  File "/usr/local/lib/python3.6/dist-packages/airflow/contrib/kubernetes/pod_launcher.py", line 104, in _monitor_pod
    for line in logs:
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 747, in __iter__
    for chunk in self.stream(decode_content=True):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 527, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 732, in read_chunked
    self._original_response.close()
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 415, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
[2019-09-17 20:50:02,536] {taskinstance.py:1070} INFO - Marking task as UP_FOR_RETRY
[2019-09-17 20:50:02,717] {logging_mixin.py:95} INFO - [2019-09-17 20:50:02,717] {email.py:126} INFO - Sent an alert email to ['data_alert@sensetime.com', 'zhaoyun@sensetime.com']
[2019-09-17 20:50:03,921] {base_task_runner.py:115} INFO - Job 16307: Subtask ods_senseid_log_ingestion [2019-09-17 20:50:03,921] {cli_action_loggers.py:82} DEBUG - Calling callbacks: []
[2019-09-17 20:50:03,925] {base_task_runner.py:115} INFO - Job 16307: Subtask ods_senseid_
{code}

I have add a try-catch over for log loop, and it works fine. Below is trace for exception catch:

{code}
ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))ods-senseid-img-identity-liven-ods-senseid-img-identity-liven-83570f1fGetting log error, stop getting log.
ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))ods-senseid-idnumber-log-shang-ods-senseid-identity-idnumber--885013e7Getting log error, stop getting log.
ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))ods-senseid-idnumber-log-shang-ods-senseid-identity-idnumber--9627184eGetting log error, stop getting log.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)