You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2023/01/10 18:12:54 UTC
[GitHub] [airflow] dadonnelly316 opened a new issue, #28836: Airflow Scheduler Hangs After Failed K8 API Call
dadonnelly316 opened a new issue, #28836:
URL: https://github.com/apache/airflow/issues/28836
### Apache Airflow version
2.5.0
### What happened
The airflow scheduler makes a call the the K8 API to create pod for a task run, but returns a 400+ http response code. This causes all subsequent airflow tasks to be stuck in "queued" or "scheduled" state. The scheduler must be restarted for tasks to enter the running state.
Similar to #28328, but not seeing the ConnectionResetError exception when calling Executor.end
```airflow-scheduler Exception when attempting to create Namespaced Pod
airflow-scheduler Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 269, in run_pod_async
resp = self.kube_client.create_namespaced_pod(
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 7356, in create_namespaced_pod
return self.create_namespaced_pod_with_http_info(namespace, body, **kwargs) # noqa: E501
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 7455, in create_namespaced_pod_with_http_info
return self.api_client.call_api(
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 391, in request
return self.rest_client.POST(url,
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 275, in POST
return self.request("POST", url,
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 234, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (500)
airflow-scheduler Reason: Internal Server Error
airflow-scheduler urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Exception when executing SchedulerJob._run_scheduler_loop
airflow-scheduler Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.9/http/client.py", line 289, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
airflow-scheduler http.client.RemoteDisconnected: Remote end closed connection without response
airflow-scheduler During handling of the above exception, another exception occurred:
airflow-scheduler Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 759, in _execute
self._run_scheduler_loop()
File "/usr/local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 887, in _run_scheduler_loop
self.executor.heartbeat()
File "/usr/local/lib/python3.9/site-packages/airflow/executors/base_executor.py", line 175, in heartbeat
self.sync()
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 632, in sync
self.kube_scheduler.run_next(task)
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 344, in run_next
self.run_pod_async(pod, **self.kube_config.kube_client_request_args)
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 275, in run_pod_async
raise e
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 269, in run_pod_async
resp = self.kube_client.create_namespaced_pod(
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 7356, in create_namespaced_pod
return self.create_namespaced_pod_with_http_info(namespace, body, **kwargs) # noqa: E501
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 7455, in create_namespaced_pod_with_http_info
return self.api_client.call_api(
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 391, in request
return self.rest_client.POST(url,
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 275, in POST
return self.request("POST", url,
File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 168, in request
r = self.pool_manager.request(
File "/usr/local/lib/python3.9/site-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/usr/local/lib/python3.9/site-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/local/lib/python3.9/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.9/http/client.py", line 289, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
airflow-scheduler urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
airflow-scheduler error Unknown error in KubernetesJobWatcher. Failing
airflow-scheduler Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 104, in run
self.resource_version = self._run(
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 166, in _run
self.process_status(
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 218, in process_status
self.watcher_queue.put((pod_id, namespace, State.FAILED, annotations, resource_version))
File "<string>", line 2, in put
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
airflow-scheduler BrokenPipeError: [Errno 32] Broken pipe
airflow-scheduler Process KubernetesJobWatcher-5:
airflow-scheduler Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 104, in run
self.resource_version = self._run(
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 166, in _run
self.process_status(
File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 218, in process_status
self.watcher_queue.put((pod_id, namespace, State.FAILED, annotations, resource_version))
File "<string>", line 2, in put
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
airflow-scheduler BrokenPipeError: [Errno 32] Broken pipe```
### What you think should happen instead
Handle ApiException - we've this error for multiple 4XX and 5XX response codes.
### How to reproduce
_No response_
### Operating System
Debian GNU/Linux 11 (bullseye)
### Versions of Apache Airflow Providers
_No response_
### Deployment
Other
### Deployment details
K8 deployment
### Anything else
It's difficult to tell how often this issue occurs since it can go unnoticed in a CI environment where the scheduler is often restarted.
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] xiankgx commented on issue #28836: Airflow Scheduler Hangs After Failed K8 API Call
Posted by "xiankgx (via GitHub)" <gi...@apache.org>.
xiankgx commented on issue #28836:
URL: https://github.com/apache/airflow/issues/28836#issuecomment-1414513094
Hi there, I'm having the same issue with Airflow version 2.2.2 deployed in Kubernetes in AWS.
I'm getting the following lines:
- ERROR - Exception when attempting to create Namespaced Pod: {
- urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
and the Airflow web interface shows "The scheduler does not appear to be running. Last heartbeat was received XXX... The DAGs list may not update, and new tasks will not be scheduled."
When this happens, I will just get the scheduler pod and delete it. What is the proper way to fix/handle this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] boring-cyborg[bot] commented on issue #28836: Airflow Scheduler Hangs After Failed K8 API Call
Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28836:
URL: https://github.com/apache/airflow/issues/28836#issuecomment-1377660122
Thanks for opening your first issue here! Be sure to follow the issue template!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] dadonnelly316 commented on issue #28836: Airflow Scheduler Hangs After Failed K8 API Call
Posted by "dadonnelly316 (via GitHub)" <gi...@apache.org>.
dadonnelly316 commented on issue #28836:
URL: https://github.com/apache/airflow/issues/28836#issuecomment-1414555586
@xi
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] dadonnelly316 commented on issue #28836: Airflow Scheduler Hangs After Failed K8 API Call
Posted by "dadonnelly316 (via GitHub)" <gi...@apache.org>.
dadonnelly316 commented on issue #28836:
URL: https://github.com/apache/airflow/issues/28836#issuecomment-1414562576
@xiankgx we ended up having to make changes in the way our K8 cluster was configured. You might need to review any plugins/security configurations that might cause pod creation to get rejected
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk closed issue #28836: Airflow Scheduler Hangs After Failed K8 API Call
Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk closed issue #28836: Airflow Scheduler Hangs After Failed K8 API Call
URL: https://github.com/apache/airflow/issues/28836
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org