You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Max (Jira)" <ji...@apache.org> on 2020/03/24 17:28:00 UTC

[jira] [Commented] (AIRFLOW-6811) Workflows randomply fail because of ERROR - Unknown error in KubernetesJobWatcher

    [ https://issues.apache.org/jira/browse/AIRFLOW-6811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066003#comment-17066003 ] 

Max commented on AIRFLOW-6811:
------------------------------

Dupe of [AIRFLOW-6040|https://issues.apache.org/jira/browse/AIRFLOW-6040]

> Workflows randomply fail because of ERROR - Unknown error in KubernetesJobWatcher
> ---------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-6811
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6811
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor-kubernetes
>    Affects Versions: 1.10.9
>            Reporter: Natalia Ogden
>            Assignee: Daniel Imberman
>            Priority: Major
>
> Hello,
> I keep experiencing issues with dags randomly failing using Kubernetes Executor due to following errors:
> [2020-02-14 18:09:19,887] \{kubernetes_executor.py:447} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
> [2020-02-14 18:09:19,901] \{kubernetes_executor.py:351} INFO - Event: and now my watch begins starting at resource_version: 1963215
> [2020-02-14 18:10:09,936] \{scheduler_job.py:949} INFO - 1 tasks up for execution:
>  <TaskInstance: hello_world_test2.dummy_task 2020-02-14 18:10:09.585646+00:00 [scheduled]>
> [2020-02-14 18:10:09,944] \{scheduler_job.py:980} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued
> [2020-02-14 18:10:09,944] \{scheduler_job.py:1008} INFO - DAG hello_world_test2 has 0/16 running and queued tasks
> [2020-02-14 18:10:09,953] \{scheduler_job.py:1058} INFO - Setting the following tasks to queued state:
>  <TaskInstance: hello_world_test2.dummy_task 2020-02-14 18:10:09.585646+00:00 [scheduled]>
> [2020-02-14 18:10:09,958] \{kubernetes_executor.py:342} ERROR - Unknown error in KubernetesJobWatcher. Failing
> Traceback (most recent call last):
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
>  yield
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
>  self._update_chunk_length()
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
>  line = self._fp.fp.readline()
>  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
>  return self._sock.recv_into(b)
>  File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
>  return self.read(nbytes, buffer)
>  File "/usr/local/lib/python3.7/ssl.py", line 929, in read
>  return self._sslobj.read(len, buffer)
> socket.timeout: The read operation timed out
>  
> During handling of the above exception, another exception occurred:
>  
> Traceback (most recent call last):
>  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
>  self.resource_version = self._run(kube_client, self.resource_version,
>  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
>  def _run(self, kube_client, resource_version, worker_uuid, kube_config):
>  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
>  for line in iter_resp_lines(resp):
>  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
>  for seg in resp.read_chunked(decode_content=False):
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
>  self._original_response.close()
>  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
>  self.gen.throw(type, value, traceback)
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
>  raise ReadTimeoutError(self._pool, None, "Read timed out.")
> urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='172.20.0.1', port=443): Read timed out.
> Process KubernetesJobWatcher-106:
> Traceback (most recent call last):
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
>  yield
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
>  self._update_chunk_length()
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
>  line = self._fp.fp.readline()
>  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
>  return self._sock.recv_into(b)
>  File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
>  return self.read(nbytes, buffer)
>  File "/usr/local/lib/python3.7/ssl.py", line 929, in read
>  return self._sslobj.read(len, buffer)
> socket.timeout: The read operation timed out
>  
> During handling of the above exception, another exception occurred:
>  
> Traceback (most recent call last):
>  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
>  self.run()
>  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
>  self.resource_version = self._run(kube_client, self.resource_version,
>  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
>  def _run(self, kube_client, resource_version, worker_uuid, kube_config):
>  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
>  for line in iter_resp_lines(resp):
>  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
>  for seg in resp.read_chunked(decode_content=False):
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
>  self._original_response.close()
>  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
>  self.gen.throw(type, value, traceback)
>  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
>  raise ReadTimeoutError(self._pool, None, "Read timed out.")
> urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='172.20.0.1', port=443): Read timed out.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)