You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Andrew Maguire <an...@gmail.com> on 2017/10/25 10:42:54 UTC

Scheduler Keeps Crashing

Hi,

I'm not really sure exactly what is causing this by i'm noticing lots of
dead periods on our airflow server. Like this:

[image: image.png]

And it corresponds with no work being actually done if i look at the
database:

[image: image.png]
When i look in supervisor i typically see:

[image: image.png]
So the airflow scheduler seems to keep being restarted by supervisor.

Then if i look in a recent stderr log from the scheduler i see:

Traceback (most recent call last):
  File "/usr/local/bin/airflow", line 27, in <module>
    args.func(args)
  File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py",
line 828, in scheduler
    job.run()
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
197, in run
    self._execute()
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
1496, in _execute
    self._execute_helper(processor_manager)
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
1541, in _execute_helper
    self.reset_state_for_orphaned_tasks(session=session)
  File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py",
line 50, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
265, in reset_state_for_orphaned_tasks
    .filter(or_(*filter_for_tis), TI.state.in_(resettable_states))
  File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line 2679, in all
  File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
2831, in __iter__
  File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
2854, in _execute_and_instances
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
945, in execute
  File "build/bdist.linux-x86_64/egg/sqlalchemy/sql/elements.py", line
263, in _execute_on_connection
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
1053, in _execute_clauseelement
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
1189, in _execute_context
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
1393, in _handle_dbapi_exception
  File "build/bdist.linux-x86_64/egg/sqlalchemy/util/compat.py", line
203, in raise_from_cause
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
1182, in _execute_context
  File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/default.py",
line 470, in do_execute
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py",
line 205, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py",
line 36, in defaulterrorhandler
    raise errorclass, errorvalue
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError)
(1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL:
u'SELECT task_instance.task_id AS task_instance_task_id, ...

I'm not really sure what might be causing this or some ways to go
about debugging it.

I have been struggling through it by having a crontab do:

# reboot system a few times a day
45 1,13 * * * /sbin/shutdown -r +5

So rebooting the system a few times a day and then it typically comes back
to life and sorts itself out (if that's a technical term :) )

So i'm now going to do the reboot every 4 hours as i've been caught a
few times when it goes into an error state a good while before the new
reboot from cron.

However i know this is a really horrible solution and i really should
try figure out what the root cause is and a better solution.

So i'm wondering if you might have any ideas or suggestions as to what
might be going on?

I'm running airplot 1.9.0alpha0 in a single google compute instance
(Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-97-generic x86_64) - 24 vCPUs,
100 GB memory) with LocalExecutor as pretty much all our dags are just
sequences of bigquery operators.

Also using a Google Cloud MySql database.

It looks like some sort of lock when trying to update the database but
i'm not sure if that's just a symptom or more related to a cause.

Any help is much appreciated as i'm a data scientist and not a data
engineer so admittedly a bit (quite) out of my depth here.

Cheers,

Andy

Re: Scheduler Keeps Crashing

Posted by Andrew Maguire <an...@gmail.com>.

Yeah, that makes sense - i think in my rush to upgrade to 1.9 i at some
stage ran "airflow upgradedb" and it just did nothing (and then i killed
it). I think i also ran it while the scheduler was still active reading and
writing to the database. Also was a day after i upgraded with pip that i
realised in needed to run upgradedb so all in all i think i messed up the
upgrade.

I have done an uninstall and reinstall to 1.8.2 and just pointed it at a
new database :) seems to be ok now.

Think i'll need to be a bit more careful in upgrading especially when
requires a database change.

Will probably stay on 1.8.2 for now as we hoping to get a data engineer in
to help with all this.

Cheers,
Andy


On Wed, Oct 25, 2017 at 8:11 PM Joy Gao <jo...@wepay.com> wrote:

> I've had something similar happening to me in the past (`airflow upgradedb`
> stuck forever waiting for a lock on a model). Because some migration
> scripts require querying the db, it got stuck waiting for the lock on a
> table.
>
> For me it had to do with calling a model's class method in a dag file,
> which threw an error after an ORM query, and did not properly release the
> lock on the table. As Bolke mentioned, I was able to track down the exact
> locking table/query by using `show engine innodb status;`.
>
> Hope it helps!
>
> On Wed, Oct 25, 2017 at 4:41 AM, Bolke de Bruin <bd...@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > Something is indeed keeping a lock on the database and not freeing it. If
> > it crashes can you find out what is exactly being locked by following the
> > guidelines in this post: https://serverfault.com/
> > questions/36260/how-can-i-show-mysql-locks <https://serverfault.com/
> > questions/36260/how-can-i-show-mysql-locks>
> >
> > As locks typically are very short (ms) I am wondering if you are sure
> that
> > you are running just *one* scheduler and not trying to start multiple at
> > the same time? Another option is that one of the tasks is misbehaving.
> >
> > Having the process list and associated PIDs (and programs) is key in
> > tracking this down.
> >
> > Bolke
> >
> >
> > > On 25 Oct 2017, at 12:42, Andrew Maguire <an...@gmail.com>
> wrote:
> > >
> > > Hi,
> > >
> > > I'm not really sure exactly what is causing this by i'm noticing lots
> of
> > dead periods on our airflow server. Like this:
> > >
> > >
> > >
> > > And it corresponds with no work being actually done if i look at the
> > database:
> > >
> > >
> > > When i look in supervisor i typically see:
> > >
> > >
> > > So the airflow scheduler seems to keep being restarted by supervisor.
> > >
> > > Then if i look in a recent stderr log from the scheduler i see:
> > > Traceback (most recent call last):
> > >   File "/usr/local/bin/airflow", line 27, in <module>
> > >     args.func(args)
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py",
> > line 828, in scheduler
> > >     job.run()
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> > 197, in run
> > >     self._execute()
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> > 1496, in _execute
> > >     self._execute_helper(processor_manager)
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> > 1541, in _execute_helper
> > >     self.reset_state_for_orphaned_tasks(session=session)
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py",
> > line 50, in wrapper
> > >     result = func(*args, **kwargs)
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> > 265, in reset_state_for_orphaned_tasks
> > >     .filter(or_(*filter_for_tis), TI.state.in_(resettable_states))
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
> > 2679, in all
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
> > 2831, in __iter__
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
> > 2854, in _execute_and_instances
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 945, in execute
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/sql/elements.py", line
> > 263, in _execute_on_connection
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 1053, in _execute_clauseelement
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 1189, in _execute_context
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 1393, in _handle_dbapi_exception
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/util/compat.py", line
> > 203, in raise_from_cause
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 1182, in _execute_context
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/default.py",
> > line 470, in do_execute
> > >   File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py",
> > line 205, in execute
> > >     self.errorhandler(self, exc, value)
> > >   File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py",
> > line 36, in defaulterrorhandler
> > >     raise errorclass, errorvalue
> > > sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError)
> > (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL:
> > u'SELECT task_instance.task_id AS task_instance_task_id, ...
> > > I'm not really sure what might be causing this or some ways to go about
> > debugging it.
> > > I have been struggling through it by having a crontab do:
> > > # reboot system a few times a day
> > > 45 1,13 * * * /sbin/shutdown -r +5
> > > So rebooting the system a few times a day and then it typically comes
> > back to life and sorts itself out (if that's a technical term :) )
> > > So i'm now going to do the reboot every 4 hours as i've been caught a
> > few times when it goes into an error state a good while before the new
> > reboot from cron.
> > > However i know this is a really horrible solution and i really should
> > try figure out what the root cause is and a better solution.
> > > So i'm wondering if you might have any ideas or suggestions as to what
> > might be going on?
> > > I'm running airplot 1.9.0alpha0 in a single google compute instance
> > (Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-97-generic x86_64) - 24 vCPUs, 100
> GB
> > memory) with LocalExecutor as pretty much all our dags are just sequences
> > of bigquery operators.
> > > Also using a Google Cloud MySql database.
> > > It looks like some sort of lock when trying to update the database but
> > i'm not sure if that's just a symptom or more related to a cause.
> > > Any help is much appreciated as i'm a data scientist and not a data
> > engineer so admittedly a bit (quite) out of my depth here.
> > > Cheers,
> > > Andy
> >
> >
>

Re: Scheduler Keeps Crashing

Posted by Joy Gao <jo...@wepay.com>.

I've had something similar happening to me in the past (`airflow upgradedb`
stuck forever waiting for a lock on a model). Because some migration
scripts require querying the db, it got stuck waiting for the lock on a
table.

For me it had to do with calling a model's class method in a dag file,
which threw an error after an ORM query, and did not properly release the
lock on the table. As Bolke mentioned, I was able to track down the exact
locking table/query by using `show engine innodb status;`.

Hope it helps!

On Wed, Oct 25, 2017 at 4:41 AM, Bolke de Bruin <bd...@gmail.com> wrote:

> Hi Andrew,
>
> Something is indeed keeping a lock on the database and not freeing it. If
> it crashes can you find out what is exactly being locked by following the
> guidelines in this post: https://serverfault.com/
> questions/36260/how-can-i-show-mysql-locks <https://serverfault.com/
> questions/36260/how-can-i-show-mysql-locks>
>
> As locks typically are very short (ms) I am wondering if you are sure that
> you are running just *one* scheduler and not trying to start multiple at
> the same time? Another option is that one of the tasks is misbehaving.
>
> Having the process list and associated PIDs (and programs) is key in
> tracking this down.
>
> Bolke
>
>
> > On 25 Oct 2017, at 12:42, Andrew Maguire <an...@gmail.com> wrote:
> >
> > Hi,
> >
> > I'm not really sure exactly what is causing this by i'm noticing lots of
> dead periods on our airflow server. Like this:
> >
> >
> >
> > And it corresponds with no work being actually done if i look at the
> database:
> >
> >
> > When i look in supervisor i typically see:
> >
> >
> > So the airflow scheduler seems to keep being restarted by supervisor.
> >
> > Then if i look in a recent stderr log from the scheduler i see:
> > Traceback (most recent call last):
> >   File "/usr/local/bin/airflow", line 27, in <module>
> >     args.func(args)
> >   File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py",
> line 828, in scheduler
> >     job.run()
> >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> 197, in run
> >     self._execute()
> >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> 1496, in _execute
> >     self._execute_helper(processor_manager)
> >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> 1541, in _execute_helper
> >     self.reset_state_for_orphaned_tasks(session=session)
> >   File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py",
> line 50, in wrapper
> >     result = func(*args, **kwargs)
> >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> 265, in reset_state_for_orphaned_tasks
> >     .filter(or_(*filter_for_tis), TI.state.in_(resettable_states))
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
> 2679, in all
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
> 2831, in __iter__
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
> 2854, in _execute_and_instances
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> 945, in execute
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/sql/elements.py", line
> 263, in _execute_on_connection
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> 1053, in _execute_clauseelement
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> 1189, in _execute_context
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> 1393, in _handle_dbapi_exception
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/util/compat.py", line
> 203, in raise_from_cause
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> 1182, in _execute_context
> >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/default.py",
> line 470, in do_execute
> >   File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py",
> line 205, in execute
> >     self.errorhandler(self, exc, value)
> >   File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py",
> line 36, in defaulterrorhandler
> >     raise errorclass, errorvalue
> > sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError)
> (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL:
> u'SELECT task_instance.task_id AS task_instance_task_id, ...
> > I'm not really sure what might be causing this or some ways to go about
> debugging it.
> > I have been struggling through it by having a crontab do:
> > # reboot system a few times a day
> > 45 1,13 * * * /sbin/shutdown -r +5
> > So rebooting the system a few times a day and then it typically comes
> back to life and sorts itself out (if that's a technical term :) )
> > So i'm now going to do the reboot every 4 hours as i've been caught a
> few times when it goes into an error state a good while before the new
> reboot from cron.
> > However i know this is a really horrible solution and i really should
> try figure out what the root cause is and a better solution.
> > So i'm wondering if you might have any ideas or suggestions as to what
> might be going on?
> > I'm running airplot 1.9.0alpha0 in a single google compute instance
> (Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-97-generic x86_64) - 24 vCPUs, 100 GB
> memory) with LocalExecutor as pretty much all our dags are just sequences
> of bigquery operators.
> > Also using a Google Cloud MySql database.
> > It looks like some sort of lock when trying to update the database but
> i'm not sure if that's just a symptom or more related to a cause.
> > Any help is much appreciated as i'm a data scientist and not a data
> engineer so admittedly a bit (quite) out of my depth here.
> > Cheers,
> > Andy
>
>

Re: Scheduler Keeps Crashing

Posted by Bolke de Bruin <bd...@gmail.com>.

Hi Andrew,

Something is indeed keeping a lock on the database and not freeing it. If it crashes can you find out what is exactly being locked by following the guidelines in this post: https://serverfault.com/questions/36260/how-can-i-show-mysql-locks <https://serverfault.com/questions/36260/how-can-i-show-mysql-locks>

As locks typically are very short (ms) I am wondering if you are sure that you are running just *one* scheduler and not trying to start multiple at the same time? Another option is that one of the tasks is misbehaving. 

Having the process list and associated PIDs (and programs) is key in tracking this down.

Bolke


> On 25 Oct 2017, at 12:42, Andrew Maguire <an...@gmail.com> wrote:
> 
> Hi, 
> 
> I'm not really sure exactly what is causing this by i'm noticing lots of dead periods on our airflow server. Like this:
> 
> 
> 
> And it corresponds with no work being actually done if i look at the database:
> 
> 
> When i look in supervisor i typically see:
> 
> 
> So the airflow scheduler seems to keep being restarted by supervisor. 
> 
> Then if i look in a recent stderr log from the scheduler i see:
> Traceback (most recent call last):
>   File "/usr/local/bin/airflow", line 27, in <module>
>     args.func(args)
>   File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 828, in scheduler
>     job.run()
>   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 197, in run
>     self._execute()
>   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 1496, in _execute
>     self._execute_helper(processor_manager)
>   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 1541, in _execute_helper
>     self.reset_state_for_orphaned_tasks(session=session)
>   File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
>     result = func(*args, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 265, in reset_state_for_orphaned_tasks
>     .filter(or_(*filter_for_tis), TI.state.in_(resettable_states))
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line 2679, in all
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line 2831, in __iter__
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line 2854, in _execute_and_instances
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 945, in execute
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 1189, in _execute_context
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 1393, in _handle_dbapi_exception
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/util/compat.py", line 203, in raise_from_cause
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line 1182, in _execute_context
>   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/default.py", line 470, in do_execute
>   File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 205, in execute
>     self.errorhandler(self, exc, value)
>   File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
>     raise errorclass, errorvalue
> sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: u'SELECT task_instance.task_id AS task_instance_task_id, ...
> I'm not really sure what might be causing this or some ways to go about debugging it. 
> I have been struggling through it by having a crontab do:
> # reboot system a few times a day
> 45 1,13 * * * /sbin/shutdown -r +5
> So rebooting the system a few times a day and then it typically comes back to life and sorts itself out (if that's a technical term :) ) 
> So i'm now going to do the reboot every 4 hours as i've been caught a few times when it goes into an error state a good while before the new reboot from cron. 
> However i know this is a really horrible solution and i really should try figure out what the root cause is and a better solution. 
> So i'm wondering if you might have any ideas or suggestions as to what might be going on?
> I'm running airplot 1.9.0alpha0 in a single google compute instance (Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-97-generic x86_64) - 24 vCPUs, 100 GB memory) with LocalExecutor as pretty much all our dags are just sequences of bigquery operators. 
> Also using a Google Cloud MySql database. 
> It looks like some sort of lock when trying to update the database but i'm not sure if that's just a symptom or more related to a cause.
> Any help is much appreciated as i'm a data scientist and not a data engineer so admittedly a bit (quite) out of my depth here. 
> Cheers,
> Andy