You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/10/24 19:03:42 UTC

[GitHub] [airflow] potiuk commented on issue #27100: Task gets stuck when the DB is unreachable

potiuk commented on issue #27100:
URL: https://github.com/apache/airflow/issues/27100#issuecomment-1289468566

@karakanb can you please take a look at your history/monitoring if any of the components of Airlfow (including pgbouncer) have restarted around the time when it happened? If so, can you please detail the restart events that you saw? I am particularly interested if there was any scheduler restart. Did you attempt to restart scheduler manually to fix the problem?

From the logs you can see - there are multiple dag file processor "fatal" errors but no scheduler restart caused by the outage.

> sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "my-db-instance.b.db.ondigitalocean.com" (10.110.0.17), port 25061 failed: FATAL: pgbouncer cannot connect to server

I think that one needs looking at @dstandish @ephraimbuddy @ashb @uranusjr - I saw other people reporting similar issues when there is a temporary problem with the database and my guts feeling tell me that this is the classic "zombie db application" problem - where application kind of works and keeps connections but some of the transactions got "completed" status and the application "thinks" that the transaction was successful, but the database failure prevented it from actual flushing the changes to the disk.

Of course we cannot do much about it on the DB side and in running Airflow, but I'd say we should crash hard scheduler whenever any of the processes or subprocesses gets "FATAL" error like that. Arflow has built in mechanism to reconcile its state whenever it gets restarted, and if the database has problems, it will fail to restart (and wlll be restarted until the DB is back). So if my guess was right, just restarting scheduler should have eventually fix the problem.

If my guess is right - We can of course tell users to restart scheduler in such cases, but this kind of error might get unnoticed by the user so it would have been much better if we detect such fatal errors happening and simply crash scheduler when it happens. That would make self-healing after such catastrophic events.

Let me know what you think.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org