You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Natalia Ogden (Jira)" <ji...@apache.org> on 2020/02/14 18:19:00 UTC

[jira] [Created] (AIRFLOW-6811) Workflows randomply fail because of ERROR - Unknown error in KubernetesJobWatcher

Natalia Ogden created AIRFLOW-6811:
--------------------------------------

             Summary: Workflows randomply fail because of ERROR - Unknown error in KubernetesJobWatcher
                 Key: AIRFLOW-6811
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6811
             Project: Apache Airflow
          Issue Type: Bug
          Components: executor-kubernetes
    Affects Versions: 1.10.9
            Reporter: Natalia Ogden
            Assignee: Daniel Imberman


Hello,

I keep experiencing issues with dags randomly failing using Kubernetes Executor due to following errors:
[2020-02-14 18:09:19,887] \{kubernetes_executor.py:447} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2020-02-14 18:09:19,901] \{kubernetes_executor.py:351} INFO - Event: and now my watch begins starting at resource_version: 1963215
[2020-02-14 18:10:09,936] \{scheduler_job.py:949} INFO - 1 tasks up for execution:
 <TaskInstance: hello_world_test2.dummy_task 2020-02-14 18:10:09.585646+00:00 [scheduled]>
[2020-02-14 18:10:09,944] \{scheduler_job.py:980} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued
[2020-02-14 18:10:09,944] \{scheduler_job.py:1008} INFO - DAG hello_world_test2 has 0/16 running and queued tasks
[2020-02-14 18:10:09,953] \{scheduler_job.py:1058} INFO - Setting the following tasks to queued state:
 <TaskInstance: hello_world_test2.dummy_task 2020-02-14 18:10:09.585646+00:00 [scheduled]>
[2020-02-14 18:10:09,958] \{kubernetes_executor.py:342} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
 yield
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
 self._update_chunk_length()
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
 line = self._fp.fp.readline()
 File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
 return self._sock.recv_into(b)
 File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
 return self.read(nbytes, buffer)
 File "/usr/local/lib/python3.7/ssl.py", line 929, in read
 return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
 File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
 self.resource_version = self._run(kube_client, self.resource_version,
 File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
 def _run(self, kube_client, resource_version, worker_uuid, kube_config):
 File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
 for line in iter_resp_lines(resp):
 File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
 for seg in resp.read_chunked(decode_content=False):
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
 self._original_response.close()
 File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
 self.gen.throw(type, value, traceback)
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
 raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='172.20.0.1', port=443): Read timed out.
Process KubernetesJobWatcher-106:
Traceback (most recent call last):
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
 yield
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
 self._update_chunk_length()
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
 line = self._fp.fp.readline()
 File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
 return self._sock.recv_into(b)
 File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
 return self.read(nbytes, buffer)
 File "/usr/local/lib/python3.7/ssl.py", line 929, in read
 return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
 File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
 self.run()
 File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
 self.resource_version = self._run(kube_client, self.resource_version,
 File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
 def _run(self, kube_client, resource_version, worker_uuid, kube_config):
 File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
 for line in iter_resp_lines(resp):
 File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
 for seg in resp.read_chunked(decode_content=False):
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
 self._original_response.close()
 File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
 self.gen.throw(type, value, traceback)
 File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
 raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='172.20.0.1', port=443): Read timed out.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)