You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/08 09:02:16 UTC

[GitHub] [airflow] Amey2400 opened a new issue #22074: DAG is marked as failed due to postgresql pod restart in k8s

Amey2400 opened a new issue #22074:
URL: https://github.com/apache/airflow/issues/22074


   ### Apache Airflow version
   
   2.2.3
   
   ### What happened
   
   We are using Airflow with KubernetesExecutor along with Postgres. Airflow and Postgres are deployed in the same namespace of the Kubernetes environment.
   During the execution, Postgres pods are getting restarted during Kubernetes scale up and scale down. Due to this, the airflow is not able to connect to Postgres and marking the running job as failed. 
   Below is the error message we received during execution:
   ```
   [2022-03-07, 12:19:30 IST] {​​​​​​local_task_job.py:264}​​​​​​ INFO - 0 downstream tasks scheduled from follow-on schedule check
   [2022-03-07, 12:19:32 IST] {​​​​​​base_job.py:230}​​​​​​ ERROR - LocalTaskJob heartbeat got an exception
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
       return fn()
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 364, in connect
       return _ConnectionFairy._checkout(self)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
       fairy = _ConnectionRecord.checkout(pool)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
       rec = pool._do_get()
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 241, in _do_get
       return self._create_connection()
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
       return _ConnectionRecord(self)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
       self.__connect(first_connect_check=True)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
       pool.logger.debug("Error on connect(): %s", e)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
       compat.raise_(
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
       raise exception
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
       connection = pool._invoke_creator(self)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
       return dialect.connect(*cargs, **cparams)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 508, in connect
       return self.dbapi.connect(*cargs, **cparams)
     File "/home/airflow/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
       conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
   psycopg2.OperationalError: connection to server at "postgres-postgresql" (172.20.180.167), port 5432 failed: Connection refused
   	Is the server running on that host and accepting TCP/IP connections?
   
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 220, in heartbeat
       session.merge(self)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2166, in merge
       return self._merge(
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2244, in _merge
       merged = self.query(mapper.class_).get(key[1])
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 1018, in get
       return self._get_impl(ident, loading.load_on_pk_identity)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 1135, in _get_impl
       return db_load_fn(self, primary_key_identity)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/loading.py", line 286, in load_on_pk_identity
       return q.one()
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3490, in one
       ret = self.one_or_none()
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3459, in one_or_none
       ret = list(self)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3535, in __iter__
       return self._execute_and_instances(context)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3556, in _execute_and_instances
       conn = self._get_bind_args(
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3571, in _get_bind_args
       return fn(
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3550, in _connection_from_session
       conn = self.session.connection(**kw)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1142, in connection
       return self._connection_for_bind(
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1150, in _connection_for_bind
       return self.transaction._connection_for_bind(
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 433, in _connection_for_bind
       conn = bind._contextual_connect()
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2302, in _contextual_connect
       self._wrap_pool_connect(self.pool.connect, None),
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2339, in _wrap_pool_connect
       Connection._handle_dbapi_exception_noconnection(
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1583, in _handle_dbapi_exception_noconnection
       util.raise_(
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
       raise exception
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
       return fn()
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 364, in connect
       return _ConnectionFairy._checkout(self)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
       fairy = _ConnectionRecord.checkout(pool)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
       rec = pool._do_get()
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 241, in _do_get
       return self._create_connection()
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
       return _ConnectionRecord(self)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
       self.__connect(first_connect_check=True)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
       pool.logger.debug("Error on connect(): %s", e)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
       compat.raise_(
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
       raise exception
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
       connection = pool._invoke_creator(self)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
       return dialect.connect(*cargs, **cparams)
     File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 508, in connect
       return self.dbapi.connect(*cargs, **cparams)
     File "/home/airflow/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
       conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
   sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "postgres-postgresql" (172.20.180.167), port 5432 failed: Connection refused
   	Is the server running on that host and accepting TCP/IP connections?
   ```
   
   
   
   ### What you expected to happen
   
   _No response_
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   The problem occurs mostly when there is a huge load in the k8s cluster when a lot of jobs are running which causes Kubernetes pod to scale up and scale down which leads to the restart of Postgres pod and causing the airflow job to be marked as failed as its not able to connect with Postgres during that time
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org