You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <ja...@potiuk.com> on 2021/07/21 06:48:16 UTC

Re: airflow issue when intermittent mysql connection drops

I think Dev@ is a better place (added). I understand the frustration, those
kind of errors are the worst - It's like the Shroedinger's cat - neither
dead nor alive until you look at it.

My personal view is that whenever situations like this happen, the software
should crash hard immediately. You save a lot of debugging, frustration and
engineering powers in order to try to workaround this kind of situations
and try to recover, but there will always be edge cases that you won't
think about  - crashing the software hard in such case is much better,
because in your deployment you need to handle restarts anyway, and starting
"clean" is much better than trying to clean-up while you are running.
Especially with most of the "serious" deployment you have certain
redundancy - in our case we already can have multiple schedulers, multiple
workers and multiple webservers, so restarting  either is not a problem.
Then recovery can (and usually will be) handled at the higher "deployment"
level - either docker compose. or K8S or custom scripts should restart such
a failed component.

Could you please share with us errors that are printed in such cases in the
logs of airflow - ideally "webserver", "scheduler", "worker" if you happen
to run Celery ? I think if we see what's going on we can investigate why
you have this "hanging" case and implement "crash hard" there. If you could
open a GitHub Issue with all the details there (cc: me - @potiuk when you
do) https://github.com/apache/airflow - I am happy to take a look at that.
However I am a bit surprised it happens, my belief is that airflow WILL
crash hard on metadata db access problem. The problem might be if Airflow
is also unaware that the connection to DB is not working.

There might be another case - and it might result from the way galera
cluster proxy works. This actually might be a configuration of timeouts in
MySQL. In case you cannot see any logs in airflow indicating errors, I
think you might have the case that either connection from airflow is simply
in "opening" state for a long time, or already established connection is
simply not killed by the proxy. In this case this is really the question of
bad configuration of

a) the proxy configuration - the proxy, when doing failover, should either
transparently move the open/being established connection or kill them. If
they kept running. the client will "think" that the connection is still
alive and send queries there and possibly wait for answers for quite some
time. I do not know Galera but I am sure they have some flexibility there
and maybe there are some options you can change
b) the mysql server configuration - the client can use various techniques
to determine if the server is up - there are various timeouts you can
configure (
https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py
)
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| connect_timeout            | 10       |
| delayed_insert_timeout     | 300      |
| innodb_lock_wait_timeout   | 50       |
| innodb_rollback_on_timeout | OFF      |
| interactive_timeout        | 28800    |
| lock_wait_timeout          | 31536000 |
| net_read_timeout           | 30       |
| net_write_timeout          | 60       |
| slave_net_timeout          | 3600     |
| wait_timeout               | 28800    |
+----------------------------+----------+
However, I think this configuration should have limited impact - it might
speed up the actual fallback done by proxy, but I think it will not he

c) Finally (and THIS is probably what can help you immediately) - you can
fine-tune the client configuration. In Airflow you can configure various
SQLAlchemy parameters to better handle your deployment.
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core
(just look for all parameters starting with sqlalchemy). We are using
sqlachemy to connect to the metadata DB and it has everything that you need
to fine tune your configuration and  - for example - setup timeouts for
different situations. In your case you probably should configure
`sql_alchemy_connect_args`:
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args
- you will see some links in our docs to sqlalchemy that you can follow and
see some examples there.. This is a simple dictionary of extra parameters
that should be passed to sqlalchemy engine initialization. Most likely
simply need to provide a client-controlled timeout on either establishing
connection, or running query or both. Those parameters depend on the
dialect used (MySql/Postgres) but also they are different capabilities
depending on which library you use to connect to mysql (the available
libraries are listed here:
https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of
them has different parameters, you need check which ones are good for each
library. However I think one of the { "timeout": N } or {"connect_timeout":
N } should work in all the libraries.

There is also one other parameter that might help
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping
- it defaults to "True", so maybe you have it disabled and that's the root
cause. This parameter performs full database operation for every connection
established, to make sure that the server is responding. This parameter
will help in case your proxy accepts connection, but - for whatever reason
it is stuck. Maybe that's the problem you have.

Just to summarize:  I think that looking at how your proxy behaves and
simple fine tuning of the Airflow SQLalchemy configuration might help
(especially if you do not see any obvious errors while you observe the
"hangs". However if you see that there are some errors in Airflow logs that
do not result in Airflow crashing - please let us know via issue and we
will take a look at that.

J.


On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <ro...@thunderbyte.ai>
wrote:

> Hi!
>
> Not sure if this is the proper place for this question; of not please let
> me know.
>
> We're running airflow on a mariadb/galera cluster, and we 're using
> haproxy to provide HA connections. Sometimes (mostly due to maintenance)
> one node is temporarily unavailalble, which forces haproxy to drop
> connections to this node after which new connections are passed to another
> (still running) node. This is quite common, and we use it for other
> software too. See
> https://galeracluster.com/library/documentation/ha-proxy.html.
> <https://galeracluster.com/library/documentation/ha-proxy.html> for more
> info.
>
> The issue we're running into however is the fact that airflow gets lost in
> this situation which to airflow is something like a dropped connection.
> airflow services seem to be running (they themselves thing they are
> running) but they're just stuck. So they don't do anything, but we don't
> know.
>
> Aiming at a HA setup, this is deffinitely not what we want. Colleages
> actually are now at the point that they disqualify airflow.
>
> Of course I can provide more details if needed, but I'd like to know first
> if this is the right place to bring this up.
>
> Best,
>
> Rolf
>


-- 
+48 660 796 129

Re: airflow issue when intermittent mysql connection drops

Posted by Jarek Potiuk <ja...@potiuk.com>.
Good luck. Let us know how it goes!

On Wed, Jul 21, 2021 at 10:49 AM Rolf Fokkens <ro...@thunderbyte.ai>
wrote:

> I'm always interested in better alternatives, thanks for the advise!
>
> For all kinds of historical reasons we are using a lot of MariaDB/Galera.
> In particular the scalability provided by Galera (active-active clusters)
> is of value, and has no equivalent altternative I think. And the headaches
> ... not really so far.
>
> For airflow we could consider postgress though; will look into that.
>
> On Wed, 2021-07-21 at 10:18 +0200, Jarek Potiuk wrote:
>
> And one more comment - if you are still at the "choose database" stage
> Rolf, I HIGHLY, HIGHLY recommend Postgres.
>
> Postgres is ROCK SOLID. The number of problems (especially in big
> installation) that user raise with Postgres comparing to MySQL is like
> "ZERO". Things like locking, HA, database encoding problems, etc. etc. are
> continuous PAIN. It's such a pain that even Google's Cloud Composer team
> took the difficult decision to migrate to Postgres for Airflow 2 even if
> they used MySQL in 1.10.
> This is mostly anecdotal, but Some of our committers who have year
> experience mentioned that they've never, ever had to worry about postgres.
> I've even seen a tweet recently "Postgres is better Mongo than Mongo" and I
> heartily agree with it.
>
> So in short - choose Postgres if you can. It will save you a lot of
> headaches, especially if you want a "serious" installation.  Also it is
> recommended to use PGbouncer with Airflow (it's part of the official Helm
> Chart - https://airflow.apache.org/docs/helm-chart/stable/index.html )
> and I think that can make setting up HA  even easier.
>
> J.
> .
>
>
>

-- 
+48 660 796 129

Re: airflow issue when intermittent mysql connection drops

Posted by Jarek Potiuk <ja...@potiuk.com>.
Good luck. Let us know how it goes!

On Wed, Jul 21, 2021 at 10:49 AM Rolf Fokkens <ro...@thunderbyte.ai>
wrote:

> I'm always interested in better alternatives, thanks for the advise!
>
> For all kinds of historical reasons we are using a lot of MariaDB/Galera.
> In particular the scalability provided by Galera (active-active clusters)
> is of value, and has no equivalent altternative I think. And the headaches
> ... not really so far.
>
> For airflow we could consider postgress though; will look into that.
>
> On Wed, 2021-07-21 at 10:18 +0200, Jarek Potiuk wrote:
>
> And one more comment - if you are still at the "choose database" stage
> Rolf, I HIGHLY, HIGHLY recommend Postgres.
>
> Postgres is ROCK SOLID. The number of problems (especially in big
> installation) that user raise with Postgres comparing to MySQL is like
> "ZERO". Things like locking, HA, database encoding problems, etc. etc. are
> continuous PAIN. It's such a pain that even Google's Cloud Composer team
> took the difficult decision to migrate to Postgres for Airflow 2 even if
> they used MySQL in 1.10.
> This is mostly anecdotal, but Some of our committers who have year
> experience mentioned that they've never, ever had to worry about postgres.
> I've even seen a tweet recently "Postgres is better Mongo than Mongo" and I
> heartily agree with it.
>
> So in short - choose Postgres if you can. It will save you a lot of
> headaches, especially if you want a "serious" installation.  Also it is
> recommended to use PGbouncer with Airflow (it's part of the official Helm
> Chart - https://airflow.apache.org/docs/helm-chart/stable/index.html )
> and I think that can make setting up HA  even easier.
>
> J.
> .
>
>
>

-- 
+48 660 796 129

Re: airflow issue when intermittent mysql connection drops

Posted by Rolf Fokkens <ro...@thunderbyte.ai>.
I'm always interested in better alternatives, thanks for the advise!

For all kinds of historical reasons we are using a lot of MariaDB/Galera. In particular the scalability provided by Galera (active-active clusters) is of value, and has no equivalent altternative I think. And the headaches ... not really so far.

For airflow we could consider postgress though; will look into that.

On Wed, 2021-07-21 at 10:18 +0200, Jarek Potiuk wrote:
And one more comment - if you are still at the "choose database" stage Rolf, I HIGHLY, HIGHLY recommend Postgres.

Postgres is ROCK SOLID. The number of problems (especially in big installation) that user raise with Postgres comparing to MySQL is like "ZERO". Things like locking, HA, database encoding problems, etc. etc. are continuous PAIN. It's such a pain that even Google's Cloud Composer team took the difficult decision to migrate to Postgres for Airflow 2 even if they used MySQL in 1.10.
This is mostly anecdotal, but Some of our committers who have year experience mentioned that they've never, ever had to worry about postgres. I've even seen a tweet recently "Postgres is better Mongo than Mongo" and I heartily agree with it.

So in short - choose Postgres if you can. It will save you a lot of headaches, especially if you want a "serious" installation.  Also it is recommended to use PGbouncer with Airflow (it's part of the official Helm Chart - https://airflow.apache.org/docs/helm-chart/stable/index.html ) and I think that can make setting up HA  even easier.

J.
.


Re: airflow issue when intermittent mysql connection drops

Posted by Rolf Fokkens <ro...@thunderbyte.ai>.
I'm always interested in better alternatives, thanks for the advise!

For all kinds of historical reasons we are using a lot of MariaDB/Galera. In particular the scalability provided by Galera (active-active clusters) is of value, and has no equivalent altternative I think. And the headaches ... not really so far.

For airflow we could consider postgress though; will look into that.

On Wed, 2021-07-21 at 10:18 +0200, Jarek Potiuk wrote:
And one more comment - if you are still at the "choose database" stage Rolf, I HIGHLY, HIGHLY recommend Postgres.

Postgres is ROCK SOLID. The number of problems (especially in big installation) that user raise with Postgres comparing to MySQL is like "ZERO". Things like locking, HA, database encoding problems, etc. etc. are continuous PAIN. It's such a pain that even Google's Cloud Composer team took the difficult decision to migrate to Postgres for Airflow 2 even if they used MySQL in 1.10.
This is mostly anecdotal, but Some of our committers who have year experience mentioned that they've never, ever had to worry about postgres. I've even seen a tweet recently "Postgres is better Mongo than Mongo" and I heartily agree with it.

So in short - choose Postgres if you can. It will save you a lot of headaches, especially if you want a "serious" installation.  Also it is recommended to use PGbouncer with Airflow (it's part of the official Helm Chart - https://airflow.apache.org/docs/helm-chart/stable/index.html ) and I think that can make setting up HA  even easier.

J.
.


Re: airflow issue when intermittent mysql connection drops

Posted by Jarek Potiuk <ja...@potiuk.com>.
And one more comment - if you are still at the "choose database" stage
Rolf, I HIGHLY, HIGHLY recommend Postgres.

Postgres is ROCK SOLID. The number of problems (especially in big
installation) that user raise with Postgres comparing to MySQL is like
"ZERO". Things like locking, HA, database encoding problems, etc. etc. are
continuous PAIN. It's such a pain that even Google's Cloud Composer team
took the difficult decision to migrate to Postgres for Airflow 2 even if
they used MySQL in 1.10.
This is mostly anecdotal, but Some of our committers who have year
experience mentioned that they've never, ever had to worry about postgres.
I've even seen a tweet recently "Postgres is better Mongo than Mongo" and I
heartily agree with it.

So in short - choose Postgres if you can. It will save you a lot of
headaches, especially if you want a "serious" installation.  Also it is
recommended to use PGbouncer with Airflow (it's part of the official Helm
Chart - https://airflow.apache.org/docs/helm-chart/stable/index.html ) and
I think that can make setting up HA  even easier.

J.
.

Re: airflow issue when intermittent mysql connection drops

Posted by Jarek Potiuk <ja...@potiuk.com>.
And one more comment - if you are still at the "choose database" stage
Rolf, I HIGHLY, HIGHLY recommend Postgres.

Postgres is ROCK SOLID. The number of problems (especially in big
installation) that user raise with Postgres comparing to MySQL is like
"ZERO". Things like locking, HA, database encoding problems, etc. etc. are
continuous PAIN. It's such a pain that even Google's Cloud Composer team
took the difficult decision to migrate to Postgres for Airflow 2 even if
they used MySQL in 1.10.
This is mostly anecdotal, but Some of our committers who have year
experience mentioned that they've never, ever had to worry about postgres.
I've even seen a tweet recently "Postgres is better Mongo than Mongo" and I
heartily agree with it.

So in short - choose Postgres if you can. It will save you a lot of
headaches, especially if you want a "serious" installation.  Also it is
recommended to use PGbouncer with Airflow (it's part of the official Helm
Chart - https://airflow.apache.org/docs/helm-chart/stable/index.html ) and
I think that can make setting up HA  even easier.

J.
.

Re: airflow issue when intermittent mysql connection drops

Posted by Jarek Potiuk <ja...@potiuk.com>.
Yeah. There are various HA solutions (Active-Passive) for both
Postgres/MySQL. I think they are mostly commercial offerings (but hey, if
you want HA you can afford it  ;) ).
BTW. We do not support Active/Active HA scenario (yet). It is mostly not
needed (DB is rarely a bottleneck even in Huge installations) so
Active/Active is not really necessary, and we are heavily relying on modern
locking solutions for Scheduler HA, so for now we can just confirm
that Active/Passive HA setup for DB is the way to go.

J


On Wed, Jul 21, 2021 at 9:09 AM Rolf Fokkens <ro...@thunderbyte.ai>
wrote:

> Thanks for the info.
>
> As mentioned we aim for high availability, and mariadb/galera provide a HA
> installation of a ralational database. Do the supported alternatives
> provide HA?
>
> I'll look into the matrix; will get back on that.
>
> On Wed, 2021-07-21 at 08:59 +0200, Jarek Potiuk wrote:
>
> BTW. Just to add on top of that - Officially we do not have MariaDB
> support. We only support Postgres, MySQL and experimentally MSSQL in main.
> So while mariadb is mostly compatible with mysql, it's a bit your decision
> to use MariaDB and you are quite a bit on your own when it comes to
> MariaDB. In fact we even discourage people from using MariaDB
> https://github.com/apache/airflow#requirements  because we do not run
> tests with either version of MariaDB in our CI.
>
> However - if you want to stick to MariaDB and are committed to it - maybe
> you could contribute and add support for MariaDB in our CI ?
> Very recently one of our users (now committer) Aneesh added support for
> MSSQL https://github.com/apache/airflow/pull/9973 and if you would like
> to follow his footsteps - adding MariaDB to the matrix of tests should not
> be too difficult.
>
> J.
>
>
> On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> I think Dev@ is a better place (added). I understand the frustration,
> those kind of errors are the worst - It's like the Shroedinger's cat -
> neither dead nor alive until you look at it.
>
> My personal view is that whenever situations like this happen, the
> software should crash hard immediately. You save a lot of debugging,
> frustration and engineering powers in order to try to workaround this kind
> of situations and try to recover, but there will always be edge cases that
> you won't think about  - crashing the software hard in such case is much
> better, because in your deployment you need to handle restarts anyway, and
> starting "clean" is much better than trying to clean-up while you are
> running. Especially with most of the "serious" deployment you have certain
> redundancy - in our case we already can have multiple schedulers, multiple
> workers and multiple webservers, so restarting  either is not a problem.
> Then recovery can (and usually will be) handled at the higher "deployment"
> level - either docker compose. or K8S or custom scripts should restart such
> a failed component.
>
> Could you please share with us errors that are printed in such cases in
> the logs of airflow - ideally "webserver", "scheduler", "worker" if you
> happen to run Celery ? I think if we see what's going on we can investigate
> why you have this "hanging" case and implement "crash hard" there. If you
> could open a GitHub Issue with all the details there (cc: me - @potiuk when
> you do) https://github.com/apache/airflow - I am happy to take a look at
> that. However I am a bit surprised it happens, my belief is that airflow
> WILL crash hard on metadata db access problem. The problem might be if
> Airflow is also unaware that the connection to DB is not working.
>
> There might be another case - and it might result from the way galera
> cluster proxy works. This actually might be a configuration of timeouts in
> MySQL. In case you cannot see any logs in airflow indicating errors, I
> think you might have the case that either connection from airflow is simply
> in "opening" state for a long time, or already established connection is
> simply not killed by the proxy. In this case this is really the question of
> bad configuration of
>
> a) the proxy configuration - the proxy, when doing failover, should either
> transparently move the open/being established connection or kill them. If
> they kept running. the client will "think" that the connection is still
> alive and send queries there and possibly wait for answers for quite some
> time. I do not know Galera but I am sure they have some flexibility there
> and maybe there are some options you can change
> b) the mysql server configuration - the client can use various techniques
> to determine if the server is up - there are various timeouts you can
> configure (
> https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py
> )
> +----------------------------+----------+
> | Variable_name              | Value    |
> +----------------------------+----------+
> | connect_timeout            | 10       |
> | delayed_insert_timeout     | 300      |
> | innodb_lock_wait_timeout   | 50       |
> | innodb_rollback_on_timeout | OFF      |
> | interactive_timeout        | 28800    |
> | lock_wait_timeout          | 31536000 |
> | net_read_timeout           | 30       |
> | net_write_timeout          | 60       |
> | slave_net_timeout          | 3600     |
> | wait_timeout               | 28800    |
> +----------------------------+----------+
> However, I think this configuration should have limited impact - it might
> speed up the actual fallback done by proxy, but I think it will not he
>
> c) Finally (and THIS is probably what can help you immediately) - you can
> fine-tune the client configuration. In Airflow you can configure various
> SQLAlchemy parameters to better handle your deployment.
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core
> (just look for all parameters starting with sqlalchemy). We are using
> sqlachemy to connect to the metadata DB and it has everything that you need
> to fine tune your configuration and  - for example - setup timeouts for
> different situations. In your case you probably should configure
> `sql_alchemy_connect_args`:
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args
> - you will see some links in our docs to sqlalchemy that you can follow and
> see some examples there.. This is a simple dictionary of extra parameters
> that should be passed to sqlalchemy engine initialization. Most likely
> simply need to provide a client-controlled timeout on either establishing
> connection, or running query or both. Those parameters depend on the
> dialect used (MySql/Postgres) but also they are different capabilities
> depending on which library you use to connect to mysql (the available
> libraries are listed here:
> https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of
> them has different parameters, you need check which ones are good for each
> library. However I think one of the { "timeout": N } or {"connect_timeout":
> N } should work in all the libraries.
>
> There is also one other parameter that might help
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping
> - it defaults to "True", so maybe you have it disabled and that's the root
> cause. This parameter performs full database operation for every connection
> established, to make sure that the server is responding. This parameter
> will help in case your proxy accepts connection, but - for whatever reason
> it is stuck. Maybe that's the problem you have.
>
> Just to summarize:  I think that looking at how your proxy behaves and
> simple fine tuning of the Airflow SQLalchemy configuration might help
> (especially if you do not see any obvious errors while you observe the
> "hangs". However if you see that there are some errors in Airflow logs that
> do not result in Airflow crashing - please let us know via issue and we
> will take a look at that.
>
> J.
>
>
> On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <ro...@thunderbyte.ai>
> wrote:
>
> Hi!
>
> Not sure if this is the proper place for this question; of not please let
> me know.
>
> We're running airflow on a mariadb/galera cluster, and we 're using
> haproxy to provide HA connections. Sometimes (mostly due to maintenance)
> one node is temporarily unavailalble, which forces haproxy to drop
> connections to this node after which new connections are passed to another
> (still running) node. This is quite common, and we use it for other
> software too. See
> https://galeracluster.com/library/documentation/ha-proxy.html.
> <https://galeracluster.com/library/documentation/ha-proxy.html> for more
> info.
>
> The issue we're running into however is the fact that airflow gets lost in
> this situation which to airflow is something like a dropped connection.
> airflow services seem to be running (they themselves thing they are
> running) but they're just stuck. So they don't do anything, but we don't
> know.
>
> Aiming at a HA setup, this is deffinitely not what we want. Colleages
> actually are now at the point that they disqualify airflow.
>
> Of course I can provide more details if needed, but I'd like to know first
> if this is the right place to bring this up.
>
> Best,
>
> Rolf
>
>
>
> --
> +48 660 796 129
>
>
>
>
>

-- 
+48 660 796 129

Re: airflow issue when intermittent mysql connection drops

Posted by Jarek Potiuk <ja...@potiuk.com>.
Yeah. There are various HA solutions (Active-Passive) for both
Postgres/MySQL. I think they are mostly commercial offerings (but hey, if
you want HA you can afford it  ;) ).
BTW. We do not support Active/Active HA scenario (yet). It is mostly not
needed (DB is rarely a bottleneck even in Huge installations) so
Active/Active is not really necessary, and we are heavily relying on modern
locking solutions for Scheduler HA, so for now we can just confirm
that Active/Passive HA setup for DB is the way to go.

J


On Wed, Jul 21, 2021 at 9:09 AM Rolf Fokkens <ro...@thunderbyte.ai>
wrote:

> Thanks for the info.
>
> As mentioned we aim for high availability, and mariadb/galera provide a HA
> installation of a ralational database. Do the supported alternatives
> provide HA?
>
> I'll look into the matrix; will get back on that.
>
> On Wed, 2021-07-21 at 08:59 +0200, Jarek Potiuk wrote:
>
> BTW. Just to add on top of that - Officially we do not have MariaDB
> support. We only support Postgres, MySQL and experimentally MSSQL in main.
> So while mariadb is mostly compatible with mysql, it's a bit your decision
> to use MariaDB and you are quite a bit on your own when it comes to
> MariaDB. In fact we even discourage people from using MariaDB
> https://github.com/apache/airflow#requirements  because we do not run
> tests with either version of MariaDB in our CI.
>
> However - if you want to stick to MariaDB and are committed to it - maybe
> you could contribute and add support for MariaDB in our CI ?
> Very recently one of our users (now committer) Aneesh added support for
> MSSQL https://github.com/apache/airflow/pull/9973 and if you would like
> to follow his footsteps - adding MariaDB to the matrix of tests should not
> be too difficult.
>
> J.
>
>
> On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> I think Dev@ is a better place (added). I understand the frustration,
> those kind of errors are the worst - It's like the Shroedinger's cat -
> neither dead nor alive until you look at it.
>
> My personal view is that whenever situations like this happen, the
> software should crash hard immediately. You save a lot of debugging,
> frustration and engineering powers in order to try to workaround this kind
> of situations and try to recover, but there will always be edge cases that
> you won't think about  - crashing the software hard in such case is much
> better, because in your deployment you need to handle restarts anyway, and
> starting "clean" is much better than trying to clean-up while you are
> running. Especially with most of the "serious" deployment you have certain
> redundancy - in our case we already can have multiple schedulers, multiple
> workers and multiple webservers, so restarting  either is not a problem.
> Then recovery can (and usually will be) handled at the higher "deployment"
> level - either docker compose. or K8S or custom scripts should restart such
> a failed component.
>
> Could you please share with us errors that are printed in such cases in
> the logs of airflow - ideally "webserver", "scheduler", "worker" if you
> happen to run Celery ? I think if we see what's going on we can investigate
> why you have this "hanging" case and implement "crash hard" there. If you
> could open a GitHub Issue with all the details there (cc: me - @potiuk when
> you do) https://github.com/apache/airflow - I am happy to take a look at
> that. However I am a bit surprised it happens, my belief is that airflow
> WILL crash hard on metadata db access problem. The problem might be if
> Airflow is also unaware that the connection to DB is not working.
>
> There might be another case - and it might result from the way galera
> cluster proxy works. This actually might be a configuration of timeouts in
> MySQL. In case you cannot see any logs in airflow indicating errors, I
> think you might have the case that either connection from airflow is simply
> in "opening" state for a long time, or already established connection is
> simply not killed by the proxy. In this case this is really the question of
> bad configuration of
>
> a) the proxy configuration - the proxy, when doing failover, should either
> transparently move the open/being established connection or kill them. If
> they kept running. the client will "think" that the connection is still
> alive and send queries there and possibly wait for answers for quite some
> time. I do not know Galera but I am sure they have some flexibility there
> and maybe there are some options you can change
> b) the mysql server configuration - the client can use various techniques
> to determine if the server is up - there are various timeouts you can
> configure (
> https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py
> )
> +----------------------------+----------+
> | Variable_name              | Value    |
> +----------------------------+----------+
> | connect_timeout            | 10       |
> | delayed_insert_timeout     | 300      |
> | innodb_lock_wait_timeout   | 50       |
> | innodb_rollback_on_timeout | OFF      |
> | interactive_timeout        | 28800    |
> | lock_wait_timeout          | 31536000 |
> | net_read_timeout           | 30       |
> | net_write_timeout          | 60       |
> | slave_net_timeout          | 3600     |
> | wait_timeout               | 28800    |
> +----------------------------+----------+
> However, I think this configuration should have limited impact - it might
> speed up the actual fallback done by proxy, but I think it will not he
>
> c) Finally (and THIS is probably what can help you immediately) - you can
> fine-tune the client configuration. In Airflow you can configure various
> SQLAlchemy parameters to better handle your deployment.
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core
> (just look for all parameters starting with sqlalchemy). We are using
> sqlachemy to connect to the metadata DB and it has everything that you need
> to fine tune your configuration and  - for example - setup timeouts for
> different situations. In your case you probably should configure
> `sql_alchemy_connect_args`:
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args
> - you will see some links in our docs to sqlalchemy that you can follow and
> see some examples there.. This is a simple dictionary of extra parameters
> that should be passed to sqlalchemy engine initialization. Most likely
> simply need to provide a client-controlled timeout on either establishing
> connection, or running query or both. Those parameters depend on the
> dialect used (MySql/Postgres) but also they are different capabilities
> depending on which library you use to connect to mysql (the available
> libraries are listed here:
> https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of
> them has different parameters, you need check which ones are good for each
> library. However I think one of the { "timeout": N } or {"connect_timeout":
> N } should work in all the libraries.
>
> There is also one other parameter that might help
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping
> - it defaults to "True", so maybe you have it disabled and that's the root
> cause. This parameter performs full database operation for every connection
> established, to make sure that the server is responding. This parameter
> will help in case your proxy accepts connection, but - for whatever reason
> it is stuck. Maybe that's the problem you have.
>
> Just to summarize:  I think that looking at how your proxy behaves and
> simple fine tuning of the Airflow SQLalchemy configuration might help
> (especially if you do not see any obvious errors while you observe the
> "hangs". However if you see that there are some errors in Airflow logs that
> do not result in Airflow crashing - please let us know via issue and we
> will take a look at that.
>
> J.
>
>
> On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <ro...@thunderbyte.ai>
> wrote:
>
> Hi!
>
> Not sure if this is the proper place for this question; of not please let
> me know.
>
> We're running airflow on a mariadb/galera cluster, and we 're using
> haproxy to provide HA connections. Sometimes (mostly due to maintenance)
> one node is temporarily unavailalble, which forces haproxy to drop
> connections to this node after which new connections are passed to another
> (still running) node. This is quite common, and we use it for other
> software too. See
> https://galeracluster.com/library/documentation/ha-proxy.html.
> <https://galeracluster.com/library/documentation/ha-proxy.html> for more
> info.
>
> The issue we're running into however is the fact that airflow gets lost in
> this situation which to airflow is something like a dropped connection.
> airflow services seem to be running (they themselves thing they are
> running) but they're just stuck. So they don't do anything, but we don't
> know.
>
> Aiming at a HA setup, this is deffinitely not what we want. Colleages
> actually are now at the point that they disqualify airflow.
>
> Of course I can provide more details if needed, but I'd like to know first
> if this is the right place to bring this up.
>
> Best,
>
> Rolf
>
>
>
> --
> +48 660 796 129
>
>
>
>
>

-- 
+48 660 796 129

Re: airflow issue when intermittent mysql connection drops

Posted by Rolf Fokkens <ro...@thunderbyte.ai>.
Thanks for the info.

As mentioned we aim for high availability, and mariadb/galera provide a HA installation of a ralational database. Do the supported alternatives provide HA?

I'll look into the matrix; will get back on that.

On Wed, 2021-07-21 at 08:59 +0200, Jarek Potiuk wrote:
BTW. Just to add on top of that - Officially we do not have MariaDB support. We only support Postgres, MySQL and experimentally MSSQL in main. So while mariadb is mostly compatible with mysql, it's a bit your decision to use MariaDB and you are quite a bit on your own when it comes to MariaDB. In fact we even discourage people from using MariaDB https://github.com/apache/airflow#requirements  because we do not run tests with either version of MariaDB in our CI.

However - if you want to stick to MariaDB and are committed to it - maybe you could contribute and add support for MariaDB in our CI ?
Very recently one of our users (now committer) Aneesh added support for MSSQL https://github.com/apache/airflow/pull/9973 and if you would like to follow his footsteps - adding MariaDB to the matrix of tests should not be too difficult.

J.


On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <ja...@potiuk.com>> wrote:
I think Dev@ is a better place (added). I understand the frustration, those kind of errors are the worst - It's like the Shroedinger's cat - neither dead nor alive until you look at it.

My personal view is that whenever situations like this happen, the software should crash hard immediately. You save a lot of debugging, frustration and engineering powers in order to try to workaround this kind of situations and try to recover, but there will always be edge cases that you won't think about  - crashing the software hard in such case is much better, because in your deployment you need to handle restarts anyway, and starting "clean" is much better than trying to clean-up while you are running. Especially with most of the "serious" deployment you have certain redundancy - in our case we already can have multiple schedulers, multiple workers and multiple webservers, so restarting  either is not a problem. Then recovery can (and usually will be) handled at the higher "deployment" level - either docker compose. or K8S or custom scripts should restart such a failed component.

Could you please share with us errors that are printed in such cases in the logs of airflow - ideally "webserver", "scheduler", "worker" if you happen to run Celery ? I think if we see what's going on we can investigate why you have this "hanging" case and implement "crash hard" there. If you could open a GitHub Issue with all the details there (cc: me - @potiuk when you do) https://github.com/apache/airflow - I am happy to take a look at that. However I am a bit surprised it happens, my belief is that airflow WILL crash hard on metadata db access problem. The problem might be if Airflow is also unaware that the connection to DB is not working.

There might be another case - and it might result from the way galera cluster proxy works. This actually might be a configuration of timeouts in MySQL. In case you cannot see any logs in airflow indicating errors, I think you might have the case that either connection from airflow is simply in "opening" state for a long time, or already established connection is simply not killed by the proxy. In this case this is really the question of bad configuration of

a) the proxy configuration - the proxy, when doing failover, should either transparently move the open/being established connection or kill them. If they kept running. the client will "think" that the connection is still alive and send queries there and possibly wait for answers for quite some time. I do not know Galera but I am sure they have some flexibility there and maybe there are some options you can change
b) the mysql server configuration - the client can use various techniques to determine if the server is up - there are various timeouts you can configure (https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py)
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| connect_timeout            | 10       |
| delayed_insert_timeout     | 300      |
| innodb_lock_wait_timeout   | 50       |
| innodb_rollback_on_timeout | OFF      |
| interactive_timeout        | 28800    |
| lock_wait_timeout          | 31536000 |
| net_read_timeout           | 30       |
| net_write_timeout          | 60       |
| slave_net_timeout          | 3600     |
| wait_timeout               | 28800    |
+----------------------------+----------+
However, I think this configuration should have limited impact - it might speed up the actual fallback done by proxy, but I think it will not he

c) Finally (and THIS is probably what can help you immediately) - you can fine-tune the client configuration. In Airflow you can configure various SQLAlchemy parameters to better handle your deployment. https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core (just look for all parameters starting with sqlalchemy). We are using sqlachemy to connect to the metadata DB and it has everything that you need to fine tune your configuration and  - for example - setup timeouts for different situations. In your case you probably should configure `sql_alchemy_connect_args`: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args - you will see some links in our docs to sqlalchemy that you can follow and see some examples there.. This is a simple dictionary of extra parameters that should be passed to sqlalchemy engine initialization. Most likely simply need to provide a client-controlled timeout on either establishing connection, or running query or both. Those parameters depend on the dialect used (MySql/Postgres) but also they are different capabilities depending on which library you use to connect to mysql (the available libraries are listed here: https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of them has different parameters, you need check which ones are good for each library. However I think one of the { "timeout": N } or {"connect_timeout": N } should work in all the libraries.

There is also one other parameter that might help https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping - it defaults to "True", so maybe you have it disabled and that's the root cause. This parameter performs full database operation for every connection established, to make sure that the server is responding. This parameter will help in case your proxy accepts connection, but - for whatever reason it is stuck. Maybe that's the problem you have.

Just to summarize:  I think that looking at how your proxy behaves and simple fine tuning of the Airflow SQLalchemy configuration might help (especially if you do not see any obvious errors while you observe the "hangs". However if you see that there are some errors in Airflow logs that do not result in Airflow crashing - please let us know via issue and we will take a look at that.

J.


On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <ro...@thunderbyte.ai>> wrote:
Hi!

Not sure if this is the proper place for this question; of not please let me know.

We're running airflow on a mariadb/galera cluster, and we 're using haproxy to provide HA connections. Sometimes (mostly due to maintenance) one node is temporarily unavailalble, which forces haproxy to drop connections to this node after which new connections are passed to another (still running) node. This is quite common, and we use it for other software too. See https://galeracluster.com/library/documentation/ha-proxy.html.<https://galeracluster.com/library/documentation/ha-proxy.html> for more info.

The issue we're running into however is the fact that airflow gets lost in this situation which to airflow is something like a dropped connection. airflow services seem to be running (they themselves thing they are running) but they're just stuck. So they don't do anything, but we don't know.

Aiming at a HA setup, this is deffinitely not what we want. Colleages actually are now at the point that they disqualify airflow.

Of course I can provide more details if needed, but I'd like to know first if this is the right place to bring this up.

Best,

Rolf


--
+48 660 796 129




Re: airflow issue when intermittent mysql connection drops

Posted by Rolf Fokkens <ro...@thunderbyte.ai>.
Thanks for the info.

As mentioned we aim for high availability, and mariadb/galera provide a HA installation of a ralational database. Do the supported alternatives provide HA?

I'll look into the matrix; will get back on that.

On Wed, 2021-07-21 at 08:59 +0200, Jarek Potiuk wrote:
BTW. Just to add on top of that - Officially we do not have MariaDB support. We only support Postgres, MySQL and experimentally MSSQL in main. So while mariadb is mostly compatible with mysql, it's a bit your decision to use MariaDB and you are quite a bit on your own when it comes to MariaDB. In fact we even discourage people from using MariaDB https://github.com/apache/airflow#requirements  because we do not run tests with either version of MariaDB in our CI.

However - if you want to stick to MariaDB and are committed to it - maybe you could contribute and add support for MariaDB in our CI ?
Very recently one of our users (now committer) Aneesh added support for MSSQL https://github.com/apache/airflow/pull/9973 and if you would like to follow his footsteps - adding MariaDB to the matrix of tests should not be too difficult.

J.


On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <ja...@potiuk.com>> wrote:
I think Dev@ is a better place (added). I understand the frustration, those kind of errors are the worst - It's like the Shroedinger's cat - neither dead nor alive until you look at it.

My personal view is that whenever situations like this happen, the software should crash hard immediately. You save a lot of debugging, frustration and engineering powers in order to try to workaround this kind of situations and try to recover, but there will always be edge cases that you won't think about  - crashing the software hard in such case is much better, because in your deployment you need to handle restarts anyway, and starting "clean" is much better than trying to clean-up while you are running. Especially with most of the "serious" deployment you have certain redundancy - in our case we already can have multiple schedulers, multiple workers and multiple webservers, so restarting  either is not a problem. Then recovery can (and usually will be) handled at the higher "deployment" level - either docker compose. or K8S or custom scripts should restart such a failed component.

Could you please share with us errors that are printed in such cases in the logs of airflow - ideally "webserver", "scheduler", "worker" if you happen to run Celery ? I think if we see what's going on we can investigate why you have this "hanging" case and implement "crash hard" there. If you could open a GitHub Issue with all the details there (cc: me - @potiuk when you do) https://github.com/apache/airflow - I am happy to take a look at that. However I am a bit surprised it happens, my belief is that airflow WILL crash hard on metadata db access problem. The problem might be if Airflow is also unaware that the connection to DB is not working.

There might be another case - and it might result from the way galera cluster proxy works. This actually might be a configuration of timeouts in MySQL. In case you cannot see any logs in airflow indicating errors, I think you might have the case that either connection from airflow is simply in "opening" state for a long time, or already established connection is simply not killed by the proxy. In this case this is really the question of bad configuration of

a) the proxy configuration - the proxy, when doing failover, should either transparently move the open/being established connection or kill them. If they kept running. the client will "think" that the connection is still alive and send queries there and possibly wait for answers for quite some time. I do not know Galera but I am sure they have some flexibility there and maybe there are some options you can change
b) the mysql server configuration - the client can use various techniques to determine if the server is up - there are various timeouts you can configure (https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py)
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| connect_timeout            | 10       |
| delayed_insert_timeout     | 300      |
| innodb_lock_wait_timeout   | 50       |
| innodb_rollback_on_timeout | OFF      |
| interactive_timeout        | 28800    |
| lock_wait_timeout          | 31536000 |
| net_read_timeout           | 30       |
| net_write_timeout          | 60       |
| slave_net_timeout          | 3600     |
| wait_timeout               | 28800    |
+----------------------------+----------+
However, I think this configuration should have limited impact - it might speed up the actual fallback done by proxy, but I think it will not he

c) Finally (and THIS is probably what can help you immediately) - you can fine-tune the client configuration. In Airflow you can configure various SQLAlchemy parameters to better handle your deployment. https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core (just look for all parameters starting with sqlalchemy). We are using sqlachemy to connect to the metadata DB and it has everything that you need to fine tune your configuration and  - for example - setup timeouts for different situations. In your case you probably should configure `sql_alchemy_connect_args`: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args - you will see some links in our docs to sqlalchemy that you can follow and see some examples there.. This is a simple dictionary of extra parameters that should be passed to sqlalchemy engine initialization. Most likely simply need to provide a client-controlled timeout on either establishing connection, or running query or both. Those parameters depend on the dialect used (MySql/Postgres) but also they are different capabilities depending on which library you use to connect to mysql (the available libraries are listed here: https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of them has different parameters, you need check which ones are good for each library. However I think one of the { "timeout": N } or {"connect_timeout": N } should work in all the libraries.

There is also one other parameter that might help https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping - it defaults to "True", so maybe you have it disabled and that's the root cause. This parameter performs full database operation for every connection established, to make sure that the server is responding. This parameter will help in case your proxy accepts connection, but - for whatever reason it is stuck. Maybe that's the problem you have.

Just to summarize:  I think that looking at how your proxy behaves and simple fine tuning of the Airflow SQLalchemy configuration might help (especially if you do not see any obvious errors while you observe the "hangs". However if you see that there are some errors in Airflow logs that do not result in Airflow crashing - please let us know via issue and we will take a look at that.

J.


On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <ro...@thunderbyte.ai>> wrote:
Hi!

Not sure if this is the proper place for this question; of not please let me know.

We're running airflow on a mariadb/galera cluster, and we 're using haproxy to provide HA connections. Sometimes (mostly due to maintenance) one node is temporarily unavailalble, which forces haproxy to drop connections to this node after which new connections are passed to another (still running) node. This is quite common, and we use it for other software too. See https://galeracluster.com/library/documentation/ha-proxy.html.<https://galeracluster.com/library/documentation/ha-proxy.html> for more info.

The issue we're running into however is the fact that airflow gets lost in this situation which to airflow is something like a dropped connection. airflow services seem to be running (they themselves thing they are running) but they're just stuck. So they don't do anything, but we don't know.

Aiming at a HA setup, this is deffinitely not what we want. Colleages actually are now at the point that they disqualify airflow.

Of course I can provide more details if needed, but I'd like to know first if this is the right place to bring this up.

Best,

Rolf


--
+48 660 796 129




Re: airflow issue when intermittent mysql connection drops

Posted by Jarek Potiuk <ja...@potiuk.com>.
BTW. Just to add on top of that - Officially we do not have MariaDB
support. We only support Postgres, MySQL and experimentally MSSQL in main.
So while mariadb is mostly compatible with mysql, it's a bit your decision
to use MariaDB and you are quite a bit on your own when it comes to
MariaDB. In fact we even discourage people from using MariaDB
https://github.com/apache/airflow#requirements  because we do not run tests
with either version of MariaDB in our CI.

However - if you want to stick to MariaDB and are committed to it - maybe
you could contribute and add support for MariaDB in our CI ?
Very recently one of our users (now committer) Aneesh added support for
MSSQL https://github.com/apache/airflow/pull/9973 and if you would like to
follow his footsteps - adding MariaDB to the matrix of tests should not be
too difficult.

J.


On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> I think Dev@ is a better place (added). I understand the frustration,
> those kind of errors are the worst - It's like the Shroedinger's cat -
> neither dead nor alive until you look at it.
>
> My personal view is that whenever situations like this happen, the
> software should crash hard immediately. You save a lot of debugging,
> frustration and engineering powers in order to try to workaround this kind
> of situations and try to recover, but there will always be edge cases that
> you won't think about  - crashing the software hard in such case is much
> better, because in your deployment you need to handle restarts anyway, and
> starting "clean" is much better than trying to clean-up while you are
> running. Especially with most of the "serious" deployment you have certain
> redundancy - in our case we already can have multiple schedulers, multiple
> workers and multiple webservers, so restarting  either is not a problem.
> Then recovery can (and usually will be) handled at the higher "deployment"
> level - either docker compose. or K8S or custom scripts should restart such
> a failed component.
>
> Could you please share with us errors that are printed in such cases in
> the logs of airflow - ideally "webserver", "scheduler", "worker" if you
> happen to run Celery ? I think if we see what's going on we can investigate
> why you have this "hanging" case and implement "crash hard" there. If you
> could open a GitHub Issue with all the details there (cc: me - @potiuk when
> you do) https://github.com/apache/airflow - I am happy to take a look at
> that. However I am a bit surprised it happens, my belief is that airflow
> WILL crash hard on metadata db access problem. The problem might be if
> Airflow is also unaware that the connection to DB is not working.
>
> There might be another case - and it might result from the way galera
> cluster proxy works. This actually might be a configuration of timeouts in
> MySQL. In case you cannot see any logs in airflow indicating errors, I
> think you might have the case that either connection from airflow is simply
> in "opening" state for a long time, or already established connection is
> simply not killed by the proxy. In this case this is really the question of
> bad configuration of
>
> a) the proxy configuration - the proxy, when doing failover, should either
> transparently move the open/being established connection or kill them. If
> they kept running. the client will "think" that the connection is still
> alive and send queries there and possibly wait for answers for quite some
> time. I do not know Galera but I am sure they have some flexibility there
> and maybe there are some options you can change
> b) the mysql server configuration - the client can use various techniques
> to determine if the server is up - there are various timeouts you can
> configure (
> https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py
> )
> +----------------------------+----------+
> | Variable_name              | Value    |
> +----------------------------+----------+
> | connect_timeout            | 10       |
> | delayed_insert_timeout     | 300      |
> | innodb_lock_wait_timeout   | 50       |
> | innodb_rollback_on_timeout | OFF      |
> | interactive_timeout        | 28800    |
> | lock_wait_timeout          | 31536000 |
> | net_read_timeout           | 30       |
> | net_write_timeout          | 60       |
> | slave_net_timeout          | 3600     |
> | wait_timeout               | 28800    |
> +----------------------------+----------+
> However, I think this configuration should have limited impact - it might
> speed up the actual fallback done by proxy, but I think it will not he
>
> c) Finally (and THIS is probably what can help you immediately) - you can
> fine-tune the client configuration. In Airflow you can configure various
> SQLAlchemy parameters to better handle your deployment.
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core
> (just look for all parameters starting with sqlalchemy). We are using
> sqlachemy to connect to the metadata DB and it has everything that you need
> to fine tune your configuration and  - for example - setup timeouts for
> different situations. In your case you probably should configure
> `sql_alchemy_connect_args`:
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args
> - you will see some links in our docs to sqlalchemy that you can follow and
> see some examples there.. This is a simple dictionary of extra parameters
> that should be passed to sqlalchemy engine initialization. Most likely
> simply need to provide a client-controlled timeout on either establishing
> connection, or running query or both. Those parameters depend on the
> dialect used (MySql/Postgres) but also they are different capabilities
> depending on which library you use to connect to mysql (the available
> libraries are listed here:
> https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of
> them has different parameters, you need check which ones are good for each
> library. However I think one of the { "timeout": N } or {"connect_timeout":
> N } should work in all the libraries.
>
> There is also one other parameter that might help
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping
> - it defaults to "True", so maybe you have it disabled and that's the root
> cause. This parameter performs full database operation for every connection
> established, to make sure that the server is responding. This parameter
> will help in case your proxy accepts connection, but - for whatever reason
> it is stuck. Maybe that's the problem you have.
>
> Just to summarize:  I think that looking at how your proxy behaves and
> simple fine tuning of the Airflow SQLalchemy configuration might help
> (especially if you do not see any obvious errors while you observe the
> "hangs". However if you see that there are some errors in Airflow logs that
> do not result in Airflow crashing - please let us know via issue and we
> will take a look at that.
>
> J.
>
>
> On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <ro...@thunderbyte.ai>
> wrote:
>
>> Hi!
>>
>> Not sure if this is the proper place for this question; of not please let
>> me know.
>>
>> We're running airflow on a mariadb/galera cluster, and we 're using
>> haproxy to provide HA connections. Sometimes (mostly due to maintenance)
>> one node is temporarily unavailalble, which forces haproxy to drop
>> connections to this node after which new connections are passed to another
>> (still running) node. This is quite common, and we use it for other
>> software too. See
>> https://galeracluster.com/library/documentation/ha-proxy.html.
>> <https://galeracluster.com/library/documentation/ha-proxy.html> for more
>> info.
>>
>> The issue we're running into however is the fact that airflow gets lost
>> in this situation which to airflow is something like a dropped connection.
>> airflow services seem to be running (they themselves thing they are
>> running) but they're just stuck. So they don't do anything, but we don't
>> know.
>>
>> Aiming at a HA setup, this is deffinitely not what we want. Colleages
>> actually are now at the point that they disqualify airflow.
>>
>> Of course I can provide more details if needed, but I'd like to know
>> first if this is the right place to bring this up.
>>
>> Best,
>>
>> Rolf
>>
>
>
> --
> +48 660 796 129
>


-- 
+48 660 796 129

Re: airflow issue when intermittent mysql connection drops

Posted by Jarek Potiuk <ja...@potiuk.com>.
BTW. Just to add on top of that - Officially we do not have MariaDB
support. We only support Postgres, MySQL and experimentally MSSQL in main.
So while mariadb is mostly compatible with mysql, it's a bit your decision
to use MariaDB and you are quite a bit on your own when it comes to
MariaDB. In fact we even discourage people from using MariaDB
https://github.com/apache/airflow#requirements  because we do not run tests
with either version of MariaDB in our CI.

However - if you want to stick to MariaDB and are committed to it - maybe
you could contribute and add support for MariaDB in our CI ?
Very recently one of our users (now committer) Aneesh added support for
MSSQL https://github.com/apache/airflow/pull/9973 and if you would like to
follow his footsteps - adding MariaDB to the matrix of tests should not be
too difficult.

J.


On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> I think Dev@ is a better place (added). I understand the frustration,
> those kind of errors are the worst - It's like the Shroedinger's cat -
> neither dead nor alive until you look at it.
>
> My personal view is that whenever situations like this happen, the
> software should crash hard immediately. You save a lot of debugging,
> frustration and engineering powers in order to try to workaround this kind
> of situations and try to recover, but there will always be edge cases that
> you won't think about  - crashing the software hard in such case is much
> better, because in your deployment you need to handle restarts anyway, and
> starting "clean" is much better than trying to clean-up while you are
> running. Especially with most of the "serious" deployment you have certain
> redundancy - in our case we already can have multiple schedulers, multiple
> workers and multiple webservers, so restarting  either is not a problem.
> Then recovery can (and usually will be) handled at the higher "deployment"
> level - either docker compose. or K8S or custom scripts should restart such
> a failed component.
>
> Could you please share with us errors that are printed in such cases in
> the logs of airflow - ideally "webserver", "scheduler", "worker" if you
> happen to run Celery ? I think if we see what's going on we can investigate
> why you have this "hanging" case and implement "crash hard" there. If you
> could open a GitHub Issue with all the details there (cc: me - @potiuk when
> you do) https://github.com/apache/airflow - I am happy to take a look at
> that. However I am a bit surprised it happens, my belief is that airflow
> WILL crash hard on metadata db access problem. The problem might be if
> Airflow is also unaware that the connection to DB is not working.
>
> There might be another case - and it might result from the way galera
> cluster proxy works. This actually might be a configuration of timeouts in
> MySQL. In case you cannot see any logs in airflow indicating errors, I
> think you might have the case that either connection from airflow is simply
> in "opening" state for a long time, or already established connection is
> simply not killed by the proxy. In this case this is really the question of
> bad configuration of
>
> a) the proxy configuration - the proxy, when doing failover, should either
> transparently move the open/being established connection or kill them. If
> they kept running. the client will "think" that the connection is still
> alive and send queries there and possibly wait for answers for quite some
> time. I do not know Galera but I am sure they have some flexibility there
> and maybe there are some options you can change
> b) the mysql server configuration - the client can use various techniques
> to determine if the server is up - there are various timeouts you can
> configure (
> https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py
> )
> +----------------------------+----------+
> | Variable_name              | Value    |
> +----------------------------+----------+
> | connect_timeout            | 10       |
> | delayed_insert_timeout     | 300      |
> | innodb_lock_wait_timeout   | 50       |
> | innodb_rollback_on_timeout | OFF      |
> | interactive_timeout        | 28800    |
> | lock_wait_timeout          | 31536000 |
> | net_read_timeout           | 30       |
> | net_write_timeout          | 60       |
> | slave_net_timeout          | 3600     |
> | wait_timeout               | 28800    |
> +----------------------------+----------+
> However, I think this configuration should have limited impact - it might
> speed up the actual fallback done by proxy, but I think it will not he
>
> c) Finally (and THIS is probably what can help you immediately) - you can
> fine-tune the client configuration. In Airflow you can configure various
> SQLAlchemy parameters to better handle your deployment.
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core
> (just look for all parameters starting with sqlalchemy). We are using
> sqlachemy to connect to the metadata DB and it has everything that you need
> to fine tune your configuration and  - for example - setup timeouts for
> different situations. In your case you probably should configure
> `sql_alchemy_connect_args`:
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args
> - you will see some links in our docs to sqlalchemy that you can follow and
> see some examples there.. This is a simple dictionary of extra parameters
> that should be passed to sqlalchemy engine initialization. Most likely
> simply need to provide a client-controlled timeout on either establishing
> connection, or running query or both. Those parameters depend on the
> dialect used (MySql/Postgres) but also they are different capabilities
> depending on which library you use to connect to mysql (the available
> libraries are listed here:
> https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of
> them has different parameters, you need check which ones are good for each
> library. However I think one of the { "timeout": N } or {"connect_timeout":
> N } should work in all the libraries.
>
> There is also one other parameter that might help
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping
> - it defaults to "True", so maybe you have it disabled and that's the root
> cause. This parameter performs full database operation for every connection
> established, to make sure that the server is responding. This parameter
> will help in case your proxy accepts connection, but - for whatever reason
> it is stuck. Maybe that's the problem you have.
>
> Just to summarize:  I think that looking at how your proxy behaves and
> simple fine tuning of the Airflow SQLalchemy configuration might help
> (especially if you do not see any obvious errors while you observe the
> "hangs". However if you see that there are some errors in Airflow logs that
> do not result in Airflow crashing - please let us know via issue and we
> will take a look at that.
>
> J.
>
>
> On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <ro...@thunderbyte.ai>
> wrote:
>
>> Hi!
>>
>> Not sure if this is the proper place for this question; of not please let
>> me know.
>>
>> We're running airflow on a mariadb/galera cluster, and we 're using
>> haproxy to provide HA connections. Sometimes (mostly due to maintenance)
>> one node is temporarily unavailalble, which forces haproxy to drop
>> connections to this node after which new connections are passed to another
>> (still running) node. This is quite common, and we use it for other
>> software too. See
>> https://galeracluster.com/library/documentation/ha-proxy.html.
>> <https://galeracluster.com/library/documentation/ha-proxy.html> for more
>> info.
>>
>> The issue we're running into however is the fact that airflow gets lost
>> in this situation which to airflow is something like a dropped connection.
>> airflow services seem to be running (they themselves thing they are
>> running) but they're just stuck. So they don't do anything, but we don't
>> know.
>>
>> Aiming at a HA setup, this is deffinitely not what we want. Colleages
>> actually are now at the point that they disqualify airflow.
>>
>> Of course I can provide more details if needed, but I'd like to know
>> first if this is the right place to bring this up.
>>
>> Best,
>>
>> Rolf
>>
>
>
> --
> +48 660 796 129
>


-- 
+48 660 796 129