You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Renaud Grisoni <re...@gmail.com> on 2016/09/30 15:07:27 UTC

Airflow bugs but stays running

Hi all,

I use Airflow v1.7.1.3 with the local scheduler and I encounter a problem
with the scheduler :
For some reason, the airflow database is no more accessible so the
scheduler display the OperationalError below. My problem is the scheduler
does not kill itself after this error, it is running but it does not run
any DAG any more. I cannot automatically restart it with Supervisor because
its process is always displayed as runnning. Each time I have a network
error, Airflow display this error and enters in this "zombie" mode, and my
DAG are not processed.

Have you heard about this problem, any suggestions?



29/09/2016 21:09:53Traceback (most recent call last):
29/09/2016 21:09:53  File "/usr/bin/airflow", line 15, in <module>
29/09/2016 21:09:53    args.func(args)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/airflow/bin/cli.py", line 455, in
scheduler
29/09/2016 21:09:53    job.run()
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/airflow/jobs.py", line 173, in run
29/09/2016 21:09:53    self._execute()
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/airflow/jobs.py", line 712, in _execute
29/09/2016 21:09:53    paused_dag_ids = dagbag.paused_dags()
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/airflow/models.py", line 429, in
paused_dags
29/09/2016 21:09:53    DagModel.is_paused == True)]
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2761, in
__iter__
29/09/2016 21:09:53    return self._execute_and_instances(context)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2774, in
_execute_and_instances
29/09/2016 21:09:53    close_with_result=True)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2765, in
_connection_from_session
29/09/2016 21:09:53    **kw)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 893, in
connection
29/09/2016 21:09:53    execution_options=execution_options)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 898, in
_connection_for_bind
29/09/2016 21:09:53    engine, execution_options)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 334, in
_connection_for_bind
29/09/2016 21:09:53    conn = bind.contextual_connect()
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2039, in
contextual_connect
29/09/2016 21:09:53    self._wrap_pool_connect(self.pool.connect, None),
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2078, in
_wrap_pool_connect
29/09/2016 21:09:53    e, dialect, self)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1405, in
_handle_dbapi_exception_noconnection
29/09/2016 21:09:53    exc_info
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 202, in
raise_from_cause
29/09/2016 21:09:53    reraise(type(exception), exception, tb=exc_tb,
cause=cause)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2074, in
_wrap_pool_connect
29/09/2016 21:09:53    return fn()
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 376, in connect
29/09/2016 21:09:53    return _ConnectionFairy._checkout(self)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 713, in
_checkout
29/09/2016 21:09:53    fairy = _ConnectionRecord.checkout(pool)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 485, in checkout
29/09/2016 21:09:53    rec.checkin()
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60,
in __exit__
29/09/2016 21:09:53    compat.reraise(exc_type, exc_value, exc_tb)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 482, in checkout
29/09/2016 21:09:53    dbapi_connection = rec.get_connection()
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 563, in
get_connection
29/09/2016 21:09:53    self.connection = self.__connect()
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 607, in
__connect
29/09/2016 21:09:53    connection = self.__pool._invoke_creator(self)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line
97, in connect
29/09/2016 21:09:53    return dialect.connect(*cargs, **cparams)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 385,
in connect
29/09/2016 21:09:53    return self.dbapi.connect(*cargs, **cparams)
29/09/2016 21:09:53  File
"/usr/lib/python2.7/site-packages/psycopg2/__init__.py", line 164, in
connect
29/09/2016 21:09:53    conn = _connect(dsn,
connection_factory=connection_factory, async=async)
29/09/2016 21:09:53sqlalchemy.exc.OperationalError:
(psycopg2.OperationalError) could not translate host name "db-airflow" to
address: Name does not resolve

Re: Airflow bugs but stays running

Posted by siddharth anand <sa...@apache.org>.
... sent too soon...

but, more info is needed to reproduce on our side. What version of Postgres
are you running and what is your env (e.g. cloud), etc...?

On Sat, Oct 1, 2016 at 1:03 AM, siddharth anand <sa...@apache.org> wrote:

> Hi Renaud,
> I've never encountered this issue though I do run postgres & LocalExecutor
> and am running 1.7.1.3 in all of my environments.
>
> I'm running on master on my local dev machine. I changed the valid sql_alchemy_conn
> = postgresql://siddharth@localhost:5432/airflow to the invalid
> sql_alchemy_conn = postgresql://siddharth@localhost:5432/airflowaaaaa.
>
> When trying to start the scheduler and webserver, both exited immediately
> with
> sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:
> database "airflowaaaaa" does not exist
>
> It looks like your problem is that the scheduler keeps trying to
> reestablish a connection and expects the problem to be transient. Why would
> restarting the process via supervisord solve your problem? Also, isn't the
> flaky dns resolver issue your core concern? You can open a JIRA to track
> this, but more information is needed to
>
> -s
>
> On Fri, Sep 30, 2016 at 8:07 AM, Renaud Grisoni <re...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I use Airflow v1.7.1.3 with the local scheduler and I encounter a problem
>> with the scheduler :
>> For some reason, the airflow database is no more accessible so the
>> scheduler display the OperationalError below. My problem is the scheduler
>> does not kill itself after this error, it is running but it does not run
>> any DAG any more. I cannot automatically restart it with Supervisor
>> because
>> its process is always displayed as runnning. Each time I have a network
>> error, Airflow display this error and enters in this "zombie" mode, and my
>> DAG are not processed.
>>
>> Have you heard about this problem, any suggestions?
>>
>>
>>
>> 29/09/2016 21:09:53Traceback (most recent call last):
>> 29/09/2016 21:09:53  File "/usr/bin/airflow", line 15, in <module>
>> 29/09/2016 21:09:53    args.func(args)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/airflow/bin/cli.py", line 455, in
>> scheduler
>> 29/09/2016 21:09:53    job.run()
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/airflow/jobs.py", line 173, in run
>> 29/09/2016 21:09:53    self._execute()
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/airflow/jobs.py", line 712, in _execute
>> 29/09/2016 21:09:53    paused_dag_ids = dagbag.paused_dags()
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/airflow/models.py", line 429, in
>> paused_dags
>> 29/09/2016 21:09:53    DagModel.is_paused == True)]
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2761, in
>> __iter__
>> 29/09/2016 21:09:53    return self._execute_and_instances(context)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2774, in
>> _execute_and_instances
>> 29/09/2016 21:09:53    close_with_result=True)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2765, in
>> _connection_from_session
>> 29/09/2016 21:09:53    **kw)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 893,
>> in
>> connection
>> 29/09/2016 21:09:53    execution_options=execution_options)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 898,
>> in
>> _connection_for_bind
>> 29/09/2016 21:09:53    engine, execution_options)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 334,
>> in
>> _connection_for_bind
>> 29/09/2016 21:09:53    conn = bind.contextual_connect()
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2039,
>> in
>> contextual_connect
>> 29/09/2016 21:09:53    self._wrap_pool_connect(self.pool.connect, None),
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2078,
>> in
>> _wrap_pool_connect
>> 29/09/2016 21:09:53    e, dialect, self)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1405,
>> in
>> _handle_dbapi_exception_noconnection
>> 29/09/2016 21:09:53    exc_info
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 202,
>> in
>> raise_from_cause
>> 29/09/2016 21:09:53    reraise(type(exception), exception, tb=exc_tb,
>> cause=cause)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2074,
>> in
>> _wrap_pool_connect
>> 29/09/2016 21:09:53    return fn()
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 376, in
>> connect
>> 29/09/2016 21:09:53    return _ConnectionFairy._checkout(self)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 713, in
>> _checkout
>> 29/09/2016 21:09:53    fairy = _ConnectionRecord.checkout(pool)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 485, in
>> checkout
>> 29/09/2016 21:09:53    rec.checkin()
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line
>> 60,
>> in __exit__
>> 29/09/2016 21:09:53    compat.reraise(exc_type, exc_value, exc_tb)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 482, in
>> checkout
>> 29/09/2016 21:09:53    dbapi_connection = rec.get_connection()
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 563, in
>> get_connection
>> 29/09/2016 21:09:53    self.connection = self.__connect()
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 607, in
>> __connect
>> 29/09/2016 21:09:53    connection = self.__pool._invoke_creator(self)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line
>> 97, in connect
>> 29/09/2016 21:09:53    return dialect.connect(*cargs, **cparams)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line
>> 385,
>> in connect
>> 29/09/2016 21:09:53    return self.dbapi.connect(*cargs, **cparams)
>> 29/09/2016 21:09:53  File
>> "/usr/lib/python2.7/site-packages/psycopg2/__init__.py", line 164, in
>> connect
>> 29/09/2016 21:09:53    conn = _connect(dsn,
>> connection_factory=connection_factory, async=async)
>> 29/09/2016 21:09:53sqlalchemy.exc.OperationalError:
>> (psycopg2.OperationalError) could not translate host name "db-airflow" to
>> address: Name does not resolve
>>
>
>

Re: Airflow bugs but stays running

Posted by siddharth anand <sa...@apache.org>.
Hi Renaud,
I've never encountered this issue though I do run postgres & LocalExecutor
and am running 1.7.1.3 in all of my environments.

I'm running on master on my local dev machine. I changed the valid
sql_alchemy_conn
= postgresql://siddharth@localhost:5432/airflow to the invalid
sql_alchemy_conn = postgresql://siddharth@localhost:5432/airflowaaaaa.

When trying to start the scheduler and webserver, both exited immediately
with
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:
database "airflowaaaaa" does not exist

It looks like your problem is that the scheduler keeps trying to
reestablish a connection and expects the problem to be transient. Why would
restarting the process via supervisord solve your problem? Also, isn't the
flaky dns resolver issue your core concern? You can open a JIRA to track
this, but more information is needed to

-s

On Fri, Sep 30, 2016 at 8:07 AM, Renaud Grisoni <re...@gmail.com>
wrote:

> Hi all,
>
> I use Airflow v1.7.1.3 with the local scheduler and I encounter a problem
> with the scheduler :
> For some reason, the airflow database is no more accessible so the
> scheduler display the OperationalError below. My problem is the scheduler
> does not kill itself after this error, it is running but it does not run
> any DAG any more. I cannot automatically restart it with Supervisor because
> its process is always displayed as runnning. Each time I have a network
> error, Airflow display this error and enters in this "zombie" mode, and my
> DAG are not processed.
>
> Have you heard about this problem, any suggestions?
>
>
>
> 29/09/2016 21:09:53Traceback (most recent call last):
> 29/09/2016 21:09:53  File "/usr/bin/airflow", line 15, in <module>
> 29/09/2016 21:09:53    args.func(args)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/airflow/bin/cli.py", line 455, in
> scheduler
> 29/09/2016 21:09:53    job.run()
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/airflow/jobs.py", line 173, in run
> 29/09/2016 21:09:53    self._execute()
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/airflow/jobs.py", line 712, in _execute
> 29/09/2016 21:09:53    paused_dag_ids = dagbag.paused_dags()
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/airflow/models.py", line 429, in
> paused_dags
> 29/09/2016 21:09:53    DagModel.is_paused == True)]
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2761, in
> __iter__
> 29/09/2016 21:09:53    return self._execute_and_instances(context)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2774, in
> _execute_and_instances
> 29/09/2016 21:09:53    close_with_result=True)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2765, in
> _connection_from_session
> 29/09/2016 21:09:53    **kw)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 893, in
> connection
> 29/09/2016 21:09:53    execution_options=execution_options)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 898, in
> _connection_for_bind
> 29/09/2016 21:09:53    engine, execution_options)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 334, in
> _connection_for_bind
> 29/09/2016 21:09:53    conn = bind.contextual_connect()
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2039,
> in
> contextual_connect
> 29/09/2016 21:09:53    self._wrap_pool_connect(self.pool.connect, None),
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2078,
> in
> _wrap_pool_connect
> 29/09/2016 21:09:53    e, dialect, self)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1405,
> in
> _handle_dbapi_exception_noconnection
> 29/09/2016 21:09:53    exc_info
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 202, in
> raise_from_cause
> 29/09/2016 21:09:53    reraise(type(exception), exception, tb=exc_tb,
> cause=cause)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2074,
> in
> _wrap_pool_connect
> 29/09/2016 21:09:53    return fn()
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 376, in
> connect
> 29/09/2016 21:09:53    return _ConnectionFairy._checkout(self)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 713, in
> _checkout
> 29/09/2016 21:09:53    fairy = _ConnectionRecord.checkout(pool)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 485, in
> checkout
> 29/09/2016 21:09:53    rec.checkin()
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line
> 60,
> in __exit__
> 29/09/2016 21:09:53    compat.reraise(exc_type, exc_value, exc_tb)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 482, in
> checkout
> 29/09/2016 21:09:53    dbapi_connection = rec.get_connection()
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 563, in
> get_connection
> 29/09/2016 21:09:53    self.connection = self.__connect()
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/pool.py", line 607, in
> __connect
> 29/09/2016 21:09:53    connection = self.__pool._invoke_creator(self)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line
> 97, in connect
> 29/09/2016 21:09:53    return dialect.connect(*cargs, **cparams)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 385,
> in connect
> 29/09/2016 21:09:53    return self.dbapi.connect(*cargs, **cparams)
> 29/09/2016 21:09:53  File
> "/usr/lib/python2.7/site-packages/psycopg2/__init__.py", line 164, in
> connect
> 29/09/2016 21:09:53    conn = _connect(dsn,
> connection_factory=connection_factory, async=async)
> 29/09/2016 21:09:53sqlalchemy.exc.OperationalError:
> (psycopg2.OperationalError) could not translate host name "db-airflow" to
> address: Name does not resolve
>