You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Max (Jira)" <ji...@apache.org> on 2020/03/24 17:28:00 UTC
[jira] [Commented] (AIRFLOW-6811) Workflows randomply fail because
of ERROR - Unknown error in KubernetesJobWatcher
[ https://issues.apache.org/jira/browse/AIRFLOW-6811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066003#comment-17066003 ]
Max commented on AIRFLOW-6811:
------------------------------
Dupe of [AIRFLOW-6040|https://issues.apache.org/jira/browse/AIRFLOW-6040]
> Workflows randomply fail because of ERROR - Unknown error in KubernetesJobWatcher
> ---------------------------------------------------------------------------------
>
> Key: AIRFLOW-6811
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6811
> Project: Apache Airflow
> Issue Type: Bug
> Components: executor-kubernetes
> Affects Versions: 1.10.9
> Reporter: Natalia Ogden
> Assignee: Daniel Imberman
> Priority: Major
>
> Hello,
> I keep experiencing issues with dags randomly failing using Kubernetes Executor due to following errors:
> [2020-02-14 18:09:19,887] \{kubernetes_executor.py:447} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
> [2020-02-14 18:09:19,901] \{kubernetes_executor.py:351} INFO - Event: and now my watch begins starting at resource_version: 1963215
> [2020-02-14 18:10:09,936] \{scheduler_job.py:949} INFO - 1 tasks up for execution:
> <TaskInstance: hello_world_test2.dummy_task 2020-02-14 18:10:09.585646+00:00 [scheduled]>
> [2020-02-14 18:10:09,944] \{scheduler_job.py:980} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued
> [2020-02-14 18:10:09,944] \{scheduler_job.py:1008} INFO - DAG hello_world_test2 has 0/16 running and queued tasks
> [2020-02-14 18:10:09,953] \{scheduler_job.py:1058} INFO - Setting the following tasks to queued state:
> <TaskInstance: hello_world_test2.dummy_task 2020-02-14 18:10:09.585646+00:00 [scheduled]>
> [2020-02-14 18:10:09,958] \{kubernetes_executor.py:342} ERROR - Unknown error in KubernetesJobWatcher. Failing
> Traceback (most recent call last):
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
> yield
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
> self._update_chunk_length()
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
> line = self._fp.fp.readline()
> File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
> return self._sock.recv_into(b)
> File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
> return self.read(nbytes, buffer)
> File "/usr/local/lib/python3.7/ssl.py", line 929, in read
> return self._sslobj.read(len, buffer)
> socket.timeout: The read operation timed out
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
> File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
> self.resource_version = self._run(kube_client, self.resource_version,
> File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
> def _run(self, kube_client, resource_version, worker_uuid, kube_config):
> File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
> for line in iter_resp_lines(resp):
> File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
> for seg in resp.read_chunked(decode_content=False):
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
> self._original_response.close()
> File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
> self.gen.throw(type, value, traceback)
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
> raise ReadTimeoutError(self._pool, None, "Read timed out.")
> urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='172.20.0.1', port=443): Read timed out.
> Process KubernetesJobWatcher-106:
> Traceback (most recent call last):
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
> yield
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
> self._update_chunk_length()
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
> line = self._fp.fp.readline()
> File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
> return self._sock.recv_into(b)
> File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
> return self.read(nbytes, buffer)
> File "/usr/local/lib/python3.7/ssl.py", line 929, in read
> return self._sslobj.read(len, buffer)
> socket.timeout: The read operation timed out
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
> File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
> self.run()
> File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
> self.resource_version = self._run(kube_client, self.resource_version,
> File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
> def _run(self, kube_client, resource_version, worker_uuid, kube_config):
> File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
> for line in iter_resp_lines(resp):
> File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
> for seg in resp.read_chunked(decode_content=False):
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
> self._original_response.close()
> File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
> self.gen.throw(type, value, traceback)
> File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
> raise ReadTimeoutError(self._pool, None, "Read timed out.")
> urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='172.20.0.1', port=443): Read timed out.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)