You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Alex Wenckus <al...@mainstreethub.com> on 2017/06/29 17:20:29 UTC

Issues when worker loses database connectivity

We are experiencing some issues when using the celery executor on Airflow
1.8.1 when a worker loses connectivity to the database. We see the
following exception logged:

OperationalError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server
has gone away') [SQL: u'SELECT celery_taskmeta.id AS celery_taskmeta_id,
celery_taskmeta.task_id AS celery_taskmeta_task_id, celery_taskmeta.status
AS celery_taskmeta_status, celery_taskmeta.result AS
celery_taskmeta_result, celery_taskmeta.date_done AS
celery_taskmeta_date_done, celery_taskmeta.traceback AS
celery_taskmeta_traceback \nFROM celery_taskmeta \nWHERE
celery_taskmeta.task_id = %s'] [parameters:
('cdda87c5-03f5-471e-b537-0e22ce432756',)]

[2017-06-17 10:01:03,084: WARNING/PoolWorker-8] Failed operation
_store_result.  Retrying 2 more times.

Traceback (most recent call last):

  File
"/usr/local/lib/python2.7/site-packages/celery/backends/database/__init__.py",
line 53, in _inner

    return fun(*args, **kwargs)

  File
"/usr/local/lib/python2.7/site-packages/celery/backends/database/__init__.py",
line 107, in _store_result

    task = list(session.query(Task).filter(Task.task_id == task_id))

  File "/usr/local/lib64/python2.7/site-packages/sqlalchemy/orm/query.py",
line 2855, in __iter__

    return self._execute_and_instances(context)

  File "/usr/local/lib64/python2.7/site-packages/sqlalchemy/orm/query.py",
line 2878, in _execute_and_instances

    result = conn.execute(querycontext.statement, self._params)

  File
"/usr/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line
945, in execute

    return meth(self, multiparams, params)

  File
"/usr/local/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line
263, in _execute_on_connection

    return connection._execute_clauseelement(self, multiparams, params)

  File
"/usr/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line
1053, in _execute_clauseelement

    compiled_sql, distilled_params

  File
"/usr/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line
1189, in _execute_context

    context)

  File
"/usr/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line
1402, in _handle_dbapi_exception

    exc_info

  File
"/usr/local/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line
203, in raise_from_cause

    reraise(type(exception), exception, tb=exc_tb, cause=cause)

  File
"/usr/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line
1182, in _execute_context

    context)

  File
"/usr/local/lib64/python2.7/site-packages/sqlalchemy/engine/default.py",
line 470, in do_execute

    cursor.execute(statement, parameters)

  File "/usr/local/lib64/python2.7/site-packages/MySQLdb/cursors.py", line
250, in execute

    self.errorhandler(self, exc, value)

  File "/usr/local/lib64/python2.7/site-packages/MySQLdb/connections.py",
line 50, in defaulterrorhandler

    raise errorvalue

We are running our database on Amazon RDS and this occurred during our
maintenance window, the new database was available almost immediately (with
a new IP address). When this occurs it appears as though the worker becomes
deadlocked and now longer processes work.

Are there any settings we can update which will help mitigate this issue?
As a workaround we have a script that detects when the queue has been
larger than 5 and active has been 0 for 10 minutes or longer, but this
presents its own issues.

Thanks!

Alex