You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/18 22:58:09 UTC

[GitHub] [airflow] taylorfinnell edited a comment on issue #18011: Task stuck in upstream_failed

taylorfinnell edited a comment on issue #18011:
URL: https://github.com/apache/airflow/issues/18011#issuecomment-922384877

Hi @ephraimbuddy - I work with @WattsInABox. We don't see `FATAL: sorry, too many clients already.` but we do see:

```
Traceback (most recent call last):
File "/opt/app-root/lib64/python3.8/site-packages/airflow/jobs/base_job.py", line 202, in heartbeat
session.merge(self)
File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 2166, in merge
return self._merge(
File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 2244, in _merge
merged = self.query(mapper.class_).get(key[1])
File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/query.py", line 1018, in get
return self._get_impl(ident, loading.load_on_pk_identity)

....

psycopg2.OperationalError: could not connect to server: Connection timed out
```

This causes the job to be SIGTERM'ed (most of the time, it seems). The tasks will now retry since we have #16301, and will eventually succeed. Sometimes it is SIGTERM'ed 5 times or more before success - which is not ideal for tasks that take an hour plus. I suspect also at times this results in the downstream tasks being set to upstream_failed when in fact the upstream is all successful - but I can't prove it.

We tried to bump the `AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` to `60` to maybe ease up on hitting the database with no luck. This error also happens when only a couple DAGs are running so there is not much load on our nodes or the database. We don't think it's a networking issue.

Our pool sqlalchemy pool size is 350, this might be high - but my understanding is the pool does not create connections until they are needed, and according to AWS monitoring the max connections we ever hit at peak time is ~300-370 which should be totally manageable on our `db.m6g.4xlarge` instance. However, if it's a 350 pool for each worker and each worker opens tons of connections that are then alive in the pool - perhaps we are exhausting PG memory

Do you have any additional advice on things to try?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org