You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Natalia Ogden (Jira)" <ji...@apache.org> on 2020/02/14 18:19:00 UTC
[jira] [Created] (AIRFLOW-6811) Workflows randomply fail because of
ERROR - Unknown error in KubernetesJobWatcher
Natalia Ogden created AIRFLOW-6811:
--------------------------------------
Summary: Workflows randomply fail because of ERROR - Unknown error in KubernetesJobWatcher
Key: AIRFLOW-6811
URL: https://issues.apache.org/jira/browse/AIRFLOW-6811
Project: Apache Airflow
Issue Type: Bug
Components: executor-kubernetes
Affects Versions: 1.10.9
Reporter: Natalia Ogden
Assignee: Daniel Imberman
Hello,
I keep experiencing issues with dags randomly failing using Kubernetes Executor due to following errors:
[2020-02-14 18:09:19,887] \{kubernetes_executor.py:447} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2020-02-14 18:09:19,901] \{kubernetes_executor.py:351} INFO - Event: and now my watch begins starting at resource_version: 1963215
[2020-02-14 18:10:09,936] \{scheduler_job.py:949} INFO - 1 tasks up for execution:
<TaskInstance: hello_world_test2.dummy_task 2020-02-14 18:10:09.585646+00:00 [scheduled]>
[2020-02-14 18:10:09,944] \{scheduler_job.py:980} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued
[2020-02-14 18:10:09,944] \{scheduler_job.py:1008} INFO - DAG hello_world_test2 has 0/16 running and queued tasks
[2020-02-14 18:10:09,953] \{scheduler_job.py:1058} INFO - Setting the following tasks to queued state:
<TaskInstance: hello_world_test2.dummy_task 2020-02-14 18:10:09.585646+00:00 [scheduled]>
[2020-02-14 18:10:09,958] \{kubernetes_executor.py:342} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
yield
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
self.resource_version = self._run(kube_client, self.resource_version,
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
def _run(self, kube_client, resource_version, worker_uuid, kube_config):
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='172.20.0.1', port=443): Read timed out.
Process KubernetesJobWatcher-106:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
yield
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
self.resource_version = self._run(kube_client, self.resource_version,
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
def _run(self, kube_client, resource_version, worker_uuid, kube_config):
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='172.20.0.1', port=443): Read timed out.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)