You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/13 07:49:52 UTC
[GitHub] [airflow] nguyenmphu opened a new issue, #28328: Scheduler pod hang when K8s API call fail
nguyenmphu opened a new issue, #28328:
URL: https://github.com/apache/airflow/issues/28328
### Apache Airflow version
Other Airflow 2 version (please specify below)
### What happened
Airflow version: `2.3.4`
I have deployed airflow with the official Helm in K8s with `KubernetesExecutor`. Sometimes the scheduler hang when calling K8s API. The log:
``` bash
ERROR - Exception when executing Executor.end
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 752, in _execute
self._run_scheduler_loop()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 842, in _run_scheduler_loop
self.executor.heartbeat()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/base_executor.py", line 171, in heartbeat
self.sync()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 649, in sync
next_event = self.event_scheduler.run(blocking=False)
File "/usr/local/lib/python3.8/sched.py", line 151, in run
action(*argument, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat
action(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 673, in _check_worker_pods_pending_timeout
for pod in pending_pods().items:
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 15697, in list_namespaced_pod
return self.list_namespaced_pod_with_http_info(namespace, **kwargs) # noqa: E501
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 15812, in list_namespaced_pod_with_http_info
return self.api_client.call_api(
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
return self.rest_client.GET(url,
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 240, in GET
return self.request("GET", url,
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 213, in request
r = self.pool_manager.request(method, url,
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 74, in request
return self.request_encode_url(
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 96, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 815, in urlopen
return self.urlopen(
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
conn.connect()
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 358, in connect
self.sock = conn = self._new_conn()
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 182, in _exit_gracefully
sys.exit(os.EX_OK)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 773, in _execute
self.executor.end()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 823, in end
self._flush_task_queue()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 776, in _flush_task_queue
self.log.debug('Executor shutting down, task_queue approximate size=%d', self.task_queue.qsize())
File "<string>", line 2, in qsize
File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 835, in _callmethod
kind, result = conn.recv()
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
```
Then the executor process was killed and the pod was still running. But the scheduler does not work.
After restarting, the scheduler worked usually.
### What you think should happen instead
When the error occurs, the executor needs to auto restart or the scheduler should be killed.
### How to reproduce
_No response_
### Operating System
Debian GNU/Linux 11 (bullseye)
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else
_No response_
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk commented on issue #28328: Scheduler pod hang when K8s API call fail
Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28328:
URL: https://github.com/apache/airflow/issues/28328#issuecomment-1367867640
I think no restart is needed. This error seems to be raised (from stacktrace) when everything is finished and simply the connection is reset by a thread that reads it (and it should be simply ignored)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk closed issue #28328: Scheduler pod hang when K8s API call fail
Posted by GitBox <gi...@apache.org>.
potiuk closed issue #28328: Scheduler pod hang when K8s API call fail
URL: https://github.com/apache/airflow/issues/28328
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] boring-cyborg[bot] commented on issue #28328: Scheduler pod hang when K8s API call fail
Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28328:
URL: https://github.com/apache/airflow/issues/28328#issuecomment-1347880144
Thanks for opening your first issue here! Be sure to follow the issue template!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] maxnathaniel commented on issue #28328: Scheduler pod hang when K8s API call fail
Posted by GitBox <gi...@apache.org>.
maxnathaniel commented on issue #28328:
URL: https://github.com/apache/airflow/issues/28328#issuecomment-1367824018
@potiuk would like to take on this `good first issue`. How should the error be handled in this scenario? Restart the executor? Presumably, closed connection refers to the call via `self.kube_client.list_namespaced_pod`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk commented on issue #28328: Scheduler pod hang when K8s API call fail
Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28328:
URL: https://github.com/apache/airflow/issues/28328#issuecomment-1366189875
Looks like handling error while reading output on closed connection could be done better.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org