You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by הילה ויזן <hi...@gmail.com> on 2016/10/27 05:13:08 UTC

Re: Scheduler getting stuck - request for details

Hi Jeff & Bolke,

Did you find anything related to this issue?
We tried to wrote a script that restarts the scheduler each 30 minutes, but
we don't want to use this patch.
We really want to understand the root cause of the problem, and understand
if there will be an official fix.


On Thu, Sep 8, 2016 at 12:28 AM, Bolke de Bruin <bd...@gmail.com> wrote:

> Furthermore have you tried it as well with RabbitMQ as Redis does not
> guarantee delivery (neither with Celery)?
> Is your redis setup located on the same machine as the scheduler or
> separated?
>
> Thanks!
> Bolke
>
> > Op 7 sep. 2016, om 23:17 heeft Bolke de Bruin <bd...@gmail.com> het
> volgende geschreven:
> >
> >
> >> Op 7 sep. 2016, om 21:37 heeft Jeff Balogh <jb...@stripe.com.INVALID>
> het volgende geschreven:
> >>
> >> On Wed, Sep 7, 2016 at 12:17 PM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >>> Ah this is the more interesting case. Are you getting tasks into
> SCHEDULED and then the scheduler itself gets stuck? Or do the workers not
> execute anything anymore?
> >>
> >> The tasks are put into the SCHEDULED state but they don't make it to a
> >> worker. This isn't deterministic. With our patch to clean up orphans,
> >> a task may flap in SCHEDULED a few times but eventually it makes it to
> >> a worker.
> >
> > Ok. So are you implying that the executor is not picking up the tasks or
> that the queue is losing tasks? Are you able to find out what redis is
> doing when a ’scheduled’ task is flapping, ie does it receive the task at
> all? Btw what happened before having the scheduled state in?
> >
> >>
> >> The scheduler and workers are otherwise running fine. We've been
> >> running with the same celery/redis setup for a year.
> >>
> >>> How do you run your scheduler? With num_runs?
> >>
> >> We don't use num_runs. We restart the scheduler when we deploy new code.
> >>
> >>> A later patch checks for these “orphaned_tasks” at scheduler start up.
> >>
> >> We check for the orphans at the top of the scheduler loop, so on every
> run.
> >
> > Ok we moved away from this for performance reasons. Depending on a
> solution for the above issue we might need to apply it to every run then.
> >
> >>
> >>> In other words can you provide some more information :-).
> >>>
> >>> Bolke
> >>>
> >>>> Op 7 sep. 2016, om 20:08 heeft Jeff Balogh <jb...@stripe.com.INVALID>
> het volgende geschreven:
> >>>>
> >>>> Ah yep, we're on https://github.com/apache/incubator-airflow/commits/
> 54b361d2a.
> >>>>
> >>>> On Wed, Sep 7, 2016 at 10:13 AM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >>>>> Hi Jeff,
> >>>>>
> >>>>> That is kind of impossible for 1.7.1.3 as the SCHEDULED state was
> introduced after release. Are you sure you are on 1.7.1.3 and not on master?
> >>>>>
> >>>>> Bolke
> >>>>>
> >>>>>> Op 7 sep. 2016, om 18:37 heeft Jeff Balogh
> <jb...@stripe.com.INVALID> het volgende geschreven:
> >>>>>>
> >>>>>> When we bumped to 1.7.1.3 we found that tasks would go into the new
> >>>>>> SCHEDULED state and get stuck there. We haven't determined why this
> >>>>>> happens.
> >>>>>>
> >>>>>> We put a hacky patch into our scheduler that sets state to None for
> >>>>>> any tasks that are SCHEDULED at the beginning of the schedule loop.
> >>>>>>
> >>>>>> Name: airflow
> >>>>>> Version: 1.7.1.3
> >>>>>> Name: celery
> >>>>>> Version: 3.1.23
> >>>>>> Name: kombu
> >>>>>> Version: 3.0.35
> >>>>>>
> >>>>>> redis_version:2.6.13
> >>>>>>
> >>>>>> On Sun, Sep 4, 2016 at 6:34 AM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> We have had some reports on this list and sometimes on Jira that
> the scheduler sometimes seems to get stuck. I would like to track down this
> issue, but until now much of the reporting has been a bit light on the
> details.
> >>>>>>>
> >>>>>>> First and foremost I am assuming that getting “stuck” is only
> happening when using a CeleryExecutor. To further track down the issue I
> would like to know the following
> >>>>>>>
> >>>>>>> - Airflow version (pip show airflow)
> >>>>>>> - Celery version (pip show celery)
> >>>>>>> - Kombu version (pip show kombu)
> >>>>>>>
> >>>>>>> - Redis version (if applicable)
> >>>>>>> - RabbitMQ version (if applicable)
> >>>>>>>
> >>>>>>> - Sanitized airflow configuration
> >>>>>>> - Sanitized broker configuration
> >>>>>>>
> >>>>>>> If possible supply, preferably debug, logs of broker, scheduler
> and worker.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>> Bolke
> >>>>>>>
> >>>>>
> >>>
> >
>
>