You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2023/01/10 18:12:54 UTC

[GitHub] [airflow] dadonnelly316 opened a new issue, #28836: Airflow Scheduler Hangs After Failed K8 API Call

dadonnelly316 opened a new issue, #28836:
URL: https://github.com/apache/airflow/issues/28836

   ### Apache Airflow version
   
   2.5.0
   
   ### What happened
   
   The airflow scheduler makes a call the the K8 API to create pod for a task run, but returns a 400+ http response code. This causes all subsequent airflow tasks to be stuck in "queued" or "scheduled" state. The scheduler must be restarted for tasks to enter the running state. 
   
   
   Similar to #28328, but not seeing the ConnectionResetError exception when calling Executor.end 
   
   
   ```airflow-scheduler Exception when attempting to create Namespaced Pod
   airflow-scheduler  Traceback (most recent call last):
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 269, in run_pod_async
       resp = self.kube_client.create_namespaced_pod(
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 7356, in create_namespaced_pod
       return self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)  # noqa: E501
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 7455, in create_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
       return self.__call_api(resource_path, method,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
       response_data = self.request(
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 391, in request
       return self.rest_client.POST(url,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 275, in POST
       return self.request("POST", url,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 234, in request
       raise ApiException(http_resp=r)
   kubernetes.client.exceptions.ApiException: (500)
   airflow-scheduler Reason: Internal Server Error
   airflow-scheduler  urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
   Exception when executing SchedulerJob._run_scheduler_loop
   airflow-scheduler Traceback (most recent call last):
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
       httplib_response = self._make_request(
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 449, in _make_request
       six.raise_from(e, None)
     File "<string>", line 3, in raise_from
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 444, in _make_request
       httplib_response = conn.getresponse()
     File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
       response.begin()
     File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
       version, status, reason = self._read_status()
     File "/usr/local/lib/python3.9/http/client.py", line 289, in _read_status
       raise RemoteDisconnected("Remote end closed connection without"
   airflow-scheduler  http.client.RemoteDisconnected: Remote end closed connection without response
   airflow-scheduler  During handling of the above exception, another exception occurred:
   airflow-scheduler Traceback (most recent call last):
     File "/usr/local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 759, in _execute
       self._run_scheduler_loop()
     File "/usr/local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 887, in _run_scheduler_loop
       self.executor.heartbeat()
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/base_executor.py", line 175, in heartbeat
       self.sync()
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 632, in sync
       self.kube_scheduler.run_next(task)
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 344, in run_next
       self.run_pod_async(pod, **self.kube_config.kube_client_request_args)
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 275, in run_pod_async
       raise e
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 269, in run_pod_async
       resp = self.kube_client.create_namespaced_pod(
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 7356, in create_namespaced_pod
       return self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)  # noqa: E501
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 7455, in create_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
       return self.__call_api(resource_path, method,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
       response_data = self.request(
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 391, in request
       return self.rest_client.POST(url,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 275, in POST
       return self.request("POST", url,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 168, in request
       r = self.pool_manager.request(
     File "/usr/local/lib/python3.9/site-packages/urllib3/request.py", line 78, in request
       return self.request_encode_body(
     File "/usr/local/lib/python3.9/site-packages/urllib3/request.py", line 170, in request_encode_body
       return self.urlopen(method, url, **extra_kw)
     File "/usr/local/lib/python3.9/site-packages/urllib3/poolmanager.py", line 376, in urlopen
       response = conn.urlopen(method, u.request_uri, **kw)
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
       retries = retries.increment(
     File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 550, in increment
       raise six.reraise(type(error), error, _stacktrace)
     File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 769, in reraise
       raise value.with_traceback(tb)
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
       httplib_response = self._make_request(
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 449, in _make_request
       six.raise_from(e, None)
     File "<string>", line 3, in raise_from
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 444, in _make_request
       httplib_response = conn.getresponse()
     File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
       response.begin()
     File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
       version, status, reason = self._read_status()
     File "/usr/local/lib/python3.9/http/client.py", line 289, in _read_status
       raise RemoteDisconnected("Remote end closed connection without"
   airflow-scheduler urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
   airflow-scheduler error  Unknown error in KubernetesJobWatcher. Failing
   airflow-scheduler Traceback (most recent call last):
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 104, in run
       self.resource_version = self._run(
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 166, in _run
       self.process_status(
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 218, in process_status
       self.watcher_queue.put((pod_id, namespace, State.FAILED, annotations, resource_version))
     File "<string>", line 2, in put
     File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
       conn.send((self._id, methodname, args, kwds))
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
       self._send_bytes(_ForkingPickler.dumps(obj))
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
       self._send(header + buf)
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
       n = write(self._handle, buf)
   airflow-scheduler BrokenPipeError: [Errno 32] Broken pipe
   airflow-scheduler Process KubernetesJobWatcher-5:
   airflow-scheduler Traceback (most recent call last):
     File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
       self.run()
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 104, in run
       self.resource_version = self._run(
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 166, in _run
       self.process_status(
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 218, in process_status
       self.watcher_queue.put((pod_id, namespace, State.FAILED, annotations, resource_version))
     File "<string>", line 2, in put
     File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
       conn.send((self._id, methodname, args, kwds))
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
       self._send_bytes(_ForkingPickler.dumps(obj))
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
       self._send(header + buf)
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
       n = write(self._handle, buf)
   airflow-scheduler BrokenPipeError: [Errno 32] Broken pipe```
   
   ### What you think should happen instead
   
   Handle ApiException  - we've this error for multiple 4XX and 5XX response codes.
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   K8 deployment
   
   ### Anything else
   
   It's difficult to tell how often this issue occurs since it can go unnoticed in a CI environment where the scheduler is often restarted. 
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] xiankgx commented on issue #28836: Airflow Scheduler Hangs After Failed K8 API Call

Posted by "xiankgx (via GitHub)" <gi...@apache.org>.
xiankgx commented on issue #28836:
URL: https://github.com/apache/airflow/issues/28836#issuecomment-1414513094

   Hi there, I'm having the same issue with Airflow version 2.2.2 deployed in Kubernetes in AWS.
   
   I'm getting the following lines:
   - ERROR - Exception when attempting to create Namespaced Pod: {
   - urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
   
   and the Airflow web interface shows "The scheduler does not appear to be running. Last heartbeat was received XXX... The DAGs list may not update, and new tasks will not be scheduled." 
   
   When this happens, I will just get the scheduler pod and delete it. What is the proper way to fix/handle this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #28836: Airflow Scheduler Hangs After Failed K8 API Call

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28836:
URL: https://github.com/apache/airflow/issues/28836#issuecomment-1377660122

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dadonnelly316 commented on issue #28836: Airflow Scheduler Hangs After Failed K8 API Call

Posted by "dadonnelly316 (via GitHub)" <gi...@apache.org>.
dadonnelly316 commented on issue #28836:
URL: https://github.com/apache/airflow/issues/28836#issuecomment-1414555586

   @xi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dadonnelly316 commented on issue #28836: Airflow Scheduler Hangs After Failed K8 API Call

Posted by "dadonnelly316 (via GitHub)" <gi...@apache.org>.
dadonnelly316 commented on issue #28836:
URL: https://github.com/apache/airflow/issues/28836#issuecomment-1414562576

   @xiankgx we ended up having to make changes in the way our K8 cluster was configured. You might need to review any plugins/security configurations that might cause pod creation to get rejected


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #28836: Airflow Scheduler Hangs After Failed K8 API Call

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk closed issue #28836: Airflow Scheduler Hangs After Failed K8 API Call
URL: https://github.com/apache/airflow/issues/28836


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org