You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/06 07:49:09 UTC

[GitHub] [airflow] scauglog opened a new issue #10193: Scheduler get killed when the DAG doesn't parse correctly in Kubernetes Worker

scauglog opened a new issue #10193:
URL: https://github.com/apache/airflow/issues/10193


   <!--
   
   Welcome to Apache Airflow!  For a smooth issue process, try to answer the following questions.
   Don't worry if they're not all applicable; just try to include what you can :-)
   
   If you need to include code snippets or logs, please put them in fenced code
   blocks.  If they're super-long, please use the details tag like
   <details><summary>super-long log</summary> lots of stuff </details>
   
   Please delete these comment blocks before submitting the issue.
   
   -->
   
   <!--
   
   -->
   
   **Apache Airflow version**: 1.10.9 -> 1.10.11
   
   
   **Kubernetes version** : 1.17
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: AWS EKS
   - **Airflow Executor**: KubernetesExecutor
   
   
   **What happened**:
   
   The scheduler get kill when the kubernetes worker is unable to parse the DAG. 
   
   The DAG is not parsing properly because an env variable is not set in the worker (AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABLES__MY_VAR). The variable is correctly set in the scheduler and webserver so the DAG is not in error in the UI. In the DAG the variable is retrieve with `os.environ["MY_VAR"]` since the variable doesn't exists in the worker it throw a `KeyError` and the worker exit.
   As I understand the scheduler try to watch the worker pod that get created but since it has fail it doesn't exists anymore and the watch request timeout, that's what we see in the log. This error is not catch so the scheduler get killed.
   
   **What you expected to happen**:
   
   The scheduler doesn't get kill and the worker error is print in the log. At least it will be nice if we could have the log from the worker otherwise it's really hard to understand why the scheduler get killed.
   
   **How to reproduce it**:
   
   create a DAG that use an environment variable : `os.environ["MY_VAR"]`
   set the variable in the scheduler and the webserver but not in the worker, don't set `AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABLES__MY_VAR`
   
   
   **Anything else we need to know**:
   
   
   This problem occur every time.
   
   Any relevant logs to include? :
   <details>
   <summary>scheduler log</summary>
   
   [2020-08-06 07:20:33,916] {{kubernetes_executor.py:342}} ERROR - Unknown error in KubernetesJobWatcher. Failing
   Traceback (most recent call last):
     File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
       return self.connection.recv_into(*args, **kwargs)
     File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
       self._raise_ssl_error(self._ssl, result)
     File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1646, in _raise_ssl_error
       raise WantReadError()
   OpenSSL.SSL.WantReadError
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
       yield
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
       self._update_chunk_length()
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
       line = self._fp.fp.readline()
     File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
       return self._sock.recv_into(b)
     File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 326, in recv_into
       raise timeout("The read operation timed out")
   socket.timeout: The read operation timed out
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
       self.worker_uuid, self.kube_config)
     File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
       **kwargs):
     File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
       for line in iter_resp_lines(resp):
     File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
       for seg in resp.read_chunked(decode_content=False):
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
       self.gen.throw(type, value, traceback)
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
       raise ReadTimeoutError(self._pool, None, "Read timed out.")
   urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.100.0.1', port=443): Read timed out.
   Process KubernetesJobWatcher-20:
   Traceback (most recent call last):
     File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
       return self.connection.recv_into(*args, **kwargs)
     File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
       self._raise_ssl_error(self._ssl, result)
     File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1646, in _raise_ssl_error
       raise WantReadError()
   OpenSSL.SSL.WantReadError
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
       yield
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
       self._update_chunk_length()
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
       line = self._fp.fp.readline()
     File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
       return self._sock.recv_into(b)
     File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 326, in recv_into
       raise timeout("The read operation timed out")
   socket.timeout: The read operation timed out
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
       self.run()
     File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
       self.worker_uuid, self.kube_config)
     File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
       **kwargs):
     File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
       for line in iter_resp_lines(resp):
     File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
       for seg in resp.read_chunked(decode_content=False):
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
       self.gen.throw(type, value, traceback)
     File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
       raise ReadTimeoutError(self._pool, None, "Read timed out.")
   urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.100.0.1', port=443): Read timed out.
   [2020-08-06 07:20:36,419] {{kubernetes_executor.py:447}} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
   [2020-08-06 07:20:36,722] {{kubernetes_executor.py:351}} INFO - Event: and now my watch begins starting at resource_version: 2673997
   
   </details>
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on issue #10193: Scheduler get killed when the DAG doesn't parse correctly in Kubernetes Worker

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #10193:
URL: https://github.com/apache/airflow/issues/10193#issuecomment-859165204


   This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on issue #10193: Scheduler get killed when the DAG doesn't parse correctly in Kubernetes Worker

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #10193:
URL: https://github.com/apache/airflow/issues/10193#issuecomment-863636559


   This issue has been closed because it has not received response from the issue author.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on issue #10193: Scheduler get killed when the DAG doesn't parse correctly in Kubernetes Worker

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on issue #10193:
URL: https://github.com/apache/airflow/issues/10193#issuecomment-822938942


   @scauglog, are you still able to reproduce this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #10193: Scheduler get killed when the DAG doesn't parse correctly in Kubernetes Worker

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #10193:
URL: https://github.com/apache/airflow/issues/10193#issuecomment-669769740


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] closed issue #10193: Scheduler get killed when the DAG doesn't parse correctly in Kubernetes Worker

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed issue #10193:
URL: https://github.com/apache/airflow/issues/10193


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on issue #10193: Scheduler get killed when the DAG doesn't parse correctly in Kubernetes Worker

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on issue #10193:
URL: https://github.com/apache/airflow/issues/10193#issuecomment-813627691


   I wasn't able to reproduce this with master or 1.10.15. @scauglog, are you still able to reproduce this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org