You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Shubham Gupta <sh...@gmail.com> on 2018/07/20 07:40:13 UTC

Failover in apache 1.8.0

Hi,

I would like to know what happens if a Celery worker running one of the
tasks crashes. Will the job be rescheduled?

Also, if the scheduler is not able to schedule a task on time due to heavy
load on all workers, what will happen to the task?

Regards
Shubham Gupta

Re: Failover in apache 1.8.0

Posted by Ruiqin Yang <yr...@gmail.com>.

"scheduler lost track of it" means cases like the scheduler process got
killed. When scheduler restarts, tasks with SCHEDULED or QUEUED state will
be set to NONE state.

For SLA, I think that delay is included, here is the logic how Airflow
calculates SLA misses
<https://github.com/apache/incubator-airflow/blob/284dbdb60ab1fec027dea4871e3013a4727f6041/airflow/jobs.py#L604-L739>.
I think the SLA in Airflow is similar( e.g. you can add sla_miss_callback
into your DAG), here's the doc
<https://airflow.apache.org/concepts.html?highlight=slas#slas> for it.

Cheers,
Kevin Y

On Fri, Jul 20, 2018 at 1:49 PM Shubham Gupta <sh...@gmail.com>
wrote:

> Also, is this delay b/w adding of task in queue and beginning of task on
> the worker not included in SLA of the task? Or is the SLA period begins
> once the task actually starts on the worker? Also, if scheduler has to wait
> for a response from the worker for the final state of the task
> (success/failure), how can the scheduler loose track of the task?
>
> FYI, I am comparing airflow with quartz, which has a mistrigger handling
> built in. Mistrigger in quartz means that the task was not started within a
> pre-configured interval beginning form the scheduled time of start. Isn't
> there something similar in airflow?
>
> Regards
> Shubham Gupta
>
> On Fri, Jul 20, 2018 at 1:42 PM Shubham Gupta <sh...@gmail.com>
> wrote:
>
> > Hi Ruiqin Yang,
> >
> > Can you please elaborate on what is meant by "and the scheduler lost
> > track of it"  in your second paragraph? When can this happen? Also, what
> > is the default state when the scheduler restarts? Is it not* None*?
> >
> > Thanks for your quick reply.
> >
> > Regards
> > Shubham Gupta
> >
> >
> > On Fri, Jul 20, 2018 at 1:04 AM Ruiqin Yang <yr...@gmail.com> wrote:
> >
> >> Hi Shubham,
> >>
> >> Worker running actual airflow task will regularly heartbeat, which
> updates
> >> the task instance entry in the DB. Scheduler will kill task instance w/o
> >> heartbeat for a long time, called zombie tasks, and if the task has
> retry
> >> left it will try to reschedule it( given all trigger rules are
> satisfied).
> >>
> >> If workers have heavy load, the scheduler will still be able to schedule
> >> tasks( putting tasks into worker queue). And you will just wait for
> >> workers
> >> to pick up the tasks from the queue. If the tasks never get picked up
> and
> >> the scheduler lost track of it, their state will be reset to NONE when
> >> scheduler restarts, they are called orphan tasks.
> >>
> >> FYI, inside Airbnb, Alex Guziel( @saguziel <https://github.com/saguziel
> >)
> >> has a patch that will requeue tasks if they don't get picked up by
> workers
> >> for a long time and he has plan to open source it.
> >>
> >> Cheers,
> >> Kevin Y
> >>
> >> On Fri, Jul 20, 2018 at 12:40 AM Shubham Gupta <
> >> shubham180695.sg@gmail.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I would like to know what happens if a Celery worker running one of
> the
> >> > tasks crashes. Will the job be rescheduled?
> >> >
> >> > Also, if the scheduler is not able to schedule a task on time due to
> >> heavy
> >> > load on all workers, what will happen to the task?
> >> >
> >> > Regards
> >> > Shubham Gupta
> >> >
> >>
> >
>

Re: Failover in apache 1.8.0

Posted by Shubham Gupta <sh...@gmail.com>.

Also, is this delay b/w adding of task in queue and beginning of task on
the worker not included in SLA of the task? Or is the SLA period begins
once the task actually starts on the worker? Also, if scheduler has to wait
for a response from the worker for the final state of the task
(success/failure), how can the scheduler loose track of the task?

FYI, I am comparing airflow with quartz, which has a mistrigger handling
built in. Mistrigger in quartz means that the task was not started within a
pre-configured interval beginning form the scheduled time of start. Isn't
there something similar in airflow?

Regards
Shubham Gupta

On Fri, Jul 20, 2018 at 1:42 PM Shubham Gupta <sh...@gmail.com>
wrote:

> Hi Ruiqin Yang,
>
> Can you please elaborate on what is meant by "and the scheduler lost
> track of it"  in your second paragraph? When can this happen? Also, what
> is the default state when the scheduler restarts? Is it not* None*?
>
> Thanks for your quick reply.
>
> Regards
> Shubham Gupta
>
>
> On Fri, Jul 20, 2018 at 1:04 AM Ruiqin Yang <yr...@gmail.com> wrote:
>
>> Hi Shubham,
>>
>> Worker running actual airflow task will regularly heartbeat, which updates
>> the task instance entry in the DB. Scheduler will kill task instance w/o
>> heartbeat for a long time, called zombie tasks, and if the task has retry
>> left it will try to reschedule it( given all trigger rules are satisfied).
>>
>> If workers have heavy load, the scheduler will still be able to schedule
>> tasks( putting tasks into worker queue). And you will just wait for
>> workers
>> to pick up the tasks from the queue. If the tasks never get picked up and
>> the scheduler lost track of it, their state will be reset to NONE when
>> scheduler restarts, they are called orphan tasks.
>>
>> FYI, inside Airbnb, Alex Guziel( @saguziel <https://github.com/saguziel>)
>> has a patch that will requeue tasks if they don't get picked up by workers
>> for a long time and he has plan to open source it.
>>
>> Cheers,
>> Kevin Y
>>
>> On Fri, Jul 20, 2018 at 12:40 AM Shubham Gupta <
>> shubham180695.sg@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I would like to know what happens if a Celery worker running one of the
>> > tasks crashes. Will the job be rescheduled?
>> >
>> > Also, if the scheduler is not able to schedule a task on time due to
>> heavy
>> > load on all workers, what will happen to the task?
>> >
>> > Regards
>> > Shubham Gupta
>> >
>>
>

Re: Failover in apache 1.8.0

Posted by Shubham Gupta <sh...@gmail.com>.

Hi Ruiqin Yang,

Can you please elaborate on what is meant by "and the scheduler lost track
of it"  in your second paragraph? When can this happen? Also, what is the
default state when the scheduler restarts? Is it not* None*?

Thanks for your quick reply.

Regards
Shubham Gupta


On Fri, Jul 20, 2018 at 1:04 AM Ruiqin Yang <yr...@gmail.com> wrote:

> Hi Shubham,
>
> Worker running actual airflow task will regularly heartbeat, which updates
> the task instance entry in the DB. Scheduler will kill task instance w/o
> heartbeat for a long time, called zombie tasks, and if the task has retry
> left it will try to reschedule it( given all trigger rules are satisfied).
>
> If workers have heavy load, the scheduler will still be able to schedule
> tasks( putting tasks into worker queue). And you will just wait for workers
> to pick up the tasks from the queue. If the tasks never get picked up and
> the scheduler lost track of it, their state will be reset to NONE when
> scheduler restarts, they are called orphan tasks.
>
> FYI, inside Airbnb, Alex Guziel( @saguziel <https://github.com/saguziel>)
> has a patch that will requeue tasks if they don't get picked up by workers
> for a long time and he has plan to open source it.
>
> Cheers,
> Kevin Y
>
> On Fri, Jul 20, 2018 at 12:40 AM Shubham Gupta <shubham180695.sg@gmail.com
> >
> wrote:
>
> > Hi,
> >
> > I would like to know what happens if a Celery worker running one of the
> > tasks crashes. Will the job be rescheduled?
> >
> > Also, if the scheduler is not able to schedule a task on time due to
> heavy
> > load on all workers, what will happen to the task?
> >
> > Regards
> > Shubham Gupta
> >
>

Re: Failover in apache 1.8.0

Posted by Ruiqin Yang <yr...@gmail.com>.

Hi Shubham,

Worker running actual airflow task will regularly heartbeat, which updates
the task instance entry in the DB. Scheduler will kill task instance w/o
heartbeat for a long time, called zombie tasks, and if the task has retry
left it will try to reschedule it( given all trigger rules are satisfied).

If workers have heavy load, the scheduler will still be able to schedule
tasks( putting tasks into worker queue). And you will just wait for workers
to pick up the tasks from the queue. If the tasks never get picked up and
the scheduler lost track of it, their state will be reset to NONE when
scheduler restarts, they are called orphan tasks.

FYI, inside Airbnb, Alex Guziel( @saguziel <https://github.com/saguziel>)
has a patch that will requeue tasks if they don't get picked up by workers
for a long time and he has plan to open source it.

Cheers,
Kevin Y

On Fri, Jul 20, 2018 at 12:40 AM Shubham Gupta <sh...@gmail.com>
wrote:

> Hi,
>
> I would like to know what happens if a Celery worker running one of the
> tasks crashes. Will the job be rescheduled?
>
> Also, if the scheduler is not able to schedule a task on time due to heavy
> load on all workers, what will happen to the task?
>
> Regards
> Shubham Gupta
>