You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/08 09:02:16 UTC
[GitHub] [airflow] Amey2400 opened a new issue #22074: DAG is marked as failed due to postgresql pod restart in k8s
Amey2400 opened a new issue #22074:
URL: https://github.com/apache/airflow/issues/22074
### Apache Airflow version
2.2.3
### What happened
We are using Airflow with KubernetesExecutor along with Postgres. Airflow and Postgres are deployed in the same namespace of the Kubernetes environment.
During the execution, Postgres pods are getting restarted during Kubernetes scale up and scale down. Due to this, the airflow is not able to connect to Postgres and marking the running job as failed.
Below is the error message we received during execution:
```
[2022-03-07, 12:19:30 IST] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2022-03-07, 12:19:32 IST] {base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
return fn()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 364, in connect
return _ConnectionFairy._checkout(self)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
rec = pool._do_get()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 241, in _do_get
return self._create_connection()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
return _ConnectionRecord(self)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
self.__connect(first_connect_check=True)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
pool.logger.debug("Error on connect(): %s", e)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.raise_(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
connection = pool._invoke_creator(self)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
return dialect.connect(*cargs, **cparams)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 508, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "postgres-postgresql" (172.20.180.167), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 220, in heartbeat
session.merge(self)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2166, in merge
return self._merge(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2244, in _merge
merged = self.query(mapper.class_).get(key[1])
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 1018, in get
return self._get_impl(ident, loading.load_on_pk_identity)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 1135, in _get_impl
return db_load_fn(self, primary_key_identity)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/loading.py", line 286, in load_on_pk_identity
return q.one()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3490, in one
ret = self.one_or_none()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3459, in one_or_none
ret = list(self)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3535, in __iter__
return self._execute_and_instances(context)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3556, in _execute_and_instances
conn = self._get_bind_args(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3571, in _get_bind_args
return fn(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 3550, in _connection_from_session
conn = self.session.connection(**kw)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1142, in connection
return self._connection_for_bind(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1150, in _connection_for_bind
return self.transaction._connection_for_bind(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 433, in _connection_for_bind
conn = bind._contextual_connect()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2302, in _contextual_connect
self._wrap_pool_connect(self.pool.connect, None),
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2339, in _wrap_pool_connect
Connection._handle_dbapi_exception_noconnection(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1583, in _handle_dbapi_exception_noconnection
util.raise_(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
return fn()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 364, in connect
return _ConnectionFairy._checkout(self)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
rec = pool._do_get()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 241, in _do_get
return self._create_connection()
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
return _ConnectionRecord(self)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
self.__connect(first_connect_check=True)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
pool.logger.debug("Error on connect(): %s", e)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.raise_(
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
connection = pool._invoke_creator(self)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
return dialect.connect(*cargs, **cparams)
File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 508, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "postgres-postgresql" (172.20.180.167), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
```
### What you expected to happen
_No response_
### How to reproduce
_No response_
### Operating System
Debian GNU/Linux
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else
The problem occurs mostly when there is a huge load in the k8s cluster when a lot of jobs are running which causes Kubernetes pod to scale up and scale down which leads to the restart of Postgres pod and causing the airflow job to be marked as failed as its not able to connect with Postgres during that time
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org