You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Ricky Shi <xi...@gmail.com> on 2020/08/16 00:03:49 UTC

Possible bug: Airflow frequently fail with AWS RDS backend when #tasks increases

Hi Everyone,

we encountered a very strange issue with airflow using AWS RDS as backend.
We found that when the number of tasks is big enough (>60), airflow will
fail with the error message (MySQL RDS backend)

sqlalchemy.exc.OperationalError:
(MySQLdb._exceptions.OperationalError) (2005, "Unknown MySQL server
host ... $AWS RDS address)

or (Postgres RDS backend):

psycopg2.OperationalError: could not translate host name $AWS RDS address


When we restart airflow, it becomes fine; and the job scheduler & website
are both running fine. However, it will fail again after a couple of days
of smooth running, with the same error message.

We found that on stack overflow, there are other ppl experiencing the same
issue but no solution found. Anyone knows how to resolve the issue?

Thanks,

-- 
Ricky Shi

Re: Possible bug: Airflow frequently fail with AWS RDS backend when #tasks increases

Posted by Ricky Shi <xi...@gmail.com>.
Thanks Brian. Your explanation does make sense and fits the symptom. What
did you do to fix the issue?



On Sat, Aug 15, 2020 at 8:23 PM Brian Greene <
brian@heisenbergwoodworking.com> wrote:

> When i had a similar issue it turned out that the way the task(s) were
> written, they'd RAPIDLY open a large number of new RDS connections.
>
> AWS RDS - particularly if you're using the cluster endpoint, is
> performing a 'dns' lookup (4 hops if i recall correctly) before your
> connection request actually resolves to a real host.  This lookup is
> throttled, and after a certain number of hits in a short time, it will
> return the error above (which is annoying, as it makes it look like the DB
> just 'vanishes' from time time).
>
> Brian
>
> On Sat, Aug 15, 2020 at 7:04 PM Ricky Shi <xi...@gmail.com> wrote:
>
> > Hi Everyone,
> >
> > we encountered a very strange issue with airflow using AWS RDS as
> backend.
> > We found that when the number of tasks is big enough (>60), airflow will
> > fail with the error message (MySQL RDS backend)
> >
> > sqlalchemy.exc.OperationalError:
> > (MySQLdb._exceptions.OperationalError) (2005, "Unknown MySQL server
> > host ... $AWS RDS address)
> >
> > or (Postgres RDS backend):
> >
> > psycopg2.OperationalError: could not translate host name $AWS RDS address
> >
> >
> > When we restart airflow, it becomes fine; and the job scheduler & website
> > are both running fine. However, it will fail again after a couple of days
> > of smooth running, with the same error message.
> >
> > We found that on stack overflow, there are other ppl experiencing the
> same
> > issue but no solution found. Anyone knows how to resolve the issue?
> >
> > Thanks,
> >
> > --
> > Ricky Shi
> >
>


-- 
Ricky Shi

Re: Possible bug: Airflow frequently fail with AWS RDS backend when #tasks increases

Posted by Brian Greene <br...@heisenbergwoodworking.com>.
When i had a similar issue it turned out that the way the task(s) were
written, they'd RAPIDLY open a large number of new RDS connections.

AWS RDS - particularly if you're using the cluster endpoint, is
performing a 'dns' lookup (4 hops if i recall correctly) before your
connection request actually resolves to a real host.  This lookup is
throttled, and after a certain number of hits in a short time, it will
return the error above (which is annoying, as it makes it look like the DB
just 'vanishes' from time time).

Brian

On Sat, Aug 15, 2020 at 7:04 PM Ricky Shi <xi...@gmail.com> wrote:

> Hi Everyone,
>
> we encountered a very strange issue with airflow using AWS RDS as backend.
> We found that when the number of tasks is big enough (>60), airflow will
> fail with the error message (MySQL RDS backend)
>
> sqlalchemy.exc.OperationalError:
> (MySQLdb._exceptions.OperationalError) (2005, "Unknown MySQL server
> host ... $AWS RDS address)
>
> or (Postgres RDS backend):
>
> psycopg2.OperationalError: could not translate host name $AWS RDS address
>
>
> When we restart airflow, it becomes fine; and the job scheduler & website
> are both running fine. However, it will fail again after a couple of days
> of smooth running, with the same error message.
>
> We found that on stack overflow, there are other ppl experiencing the same
> issue but no solution found. Anyone knows how to resolve the issue?
>
> Thanks,
>
> --
> Ricky Shi
>