You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by va...@gmail.com, va...@gmail.com on 2018/08/28 04:09:46 UTC

Getting Task Killed Externally

Hi Everyone,

Since last 2 weeks, we're facing an issue with LocalExecutor setup of Airflow v1.9(MySQL as metastore) where in a DAG if retry has been configured and initial try_number gets failed, then nearly 8 out of 10 times, task will get stuck in up_for_retry state, in fact there is no running state coming after Scheduled>Queued in TI. In Job table entry gets successful within fraction of second and failed entry gets logged in task_fail table without task even reaching to operator code and as a result we get aemail alert saying 

```
Try 2 out of 4
Exception:
Executor reports task instance %s finished (%s) although the task says its %s. Was the task killed externally?
```

But when default value of job_heartbeat_sec changed from 5 to 30 seconds(https://groups.google.com/forum/#!topic/airbnb_airflow/hTXKFw2XFx0 mentioned by Max sometimes back in 2016 for healthy supervision), this issue stops arising. But we're still clueless how this new configuration actually solved/suppressed the issue, any key information around it would really help here.

Regards,
Vardan Gupta 

Re: Getting Task Killed Externally

Posted by Trent Robbins <ro...@gmail.com>.
We saw the same thing. Only a few truly active tasks yet the task queue was
filling up with pending tasks.

Best,
Trent

On Tue, Aug 28, 2018 at 12:47 AM Vardan Gupta <va...@gmail.com>
wrote:

> Hi Trent,
>
> Thanks for replying. Though you're suggesting that there might be case
> where we might be hitting caps, but on our side, there are hardly any
> concurrent tasks, rarely 1-2 at a time with parallelism set to 50. But
> yeah, we'll just increase the parallelism and and see if that solves the
> problem too.
>
> Thanks,
> Vardan Gupta
>
> On Tue, Aug 28, 2018 at 11:17 AM Trent Robbins <ro...@gmail.com> wrote:
>
> > Hi Vardan,
> >
> > We had this issue - I recommend increasing the parallelism config
> variable
> > to something like 128 or 512. I have no idea what side effects this could
> > have. So far, none. This happened to us with LocalExecutor and our
> > monitoring showed a clear issue with hitting a cap on number of
> concurrent
> > tasks tasks. I probably should have reported it, but we still aren't sure
> > what happened and have not investigated why those tasks are not getting
> > kicked back up into the queue or whatever.
> >
> > You may need to increase other config variables, too, if they also cause
> > you to hit caps. Some people are conservative about these variables. If
> you
> > are feeling conservative, you can get some better telemetry into this
> with
> > prometheus and grafana. We followed this route but resolved to just set
> the
> > cap very high and resolve any side effects afterwards.
> >
> > Best,
> > Trent
> >
> >
> > On Mon, Aug 27, 2018 at 21:09 vardanguptacse@gmail.com <
> > vardanguptacse@gmail.com> wrote:
> >
> > > Hi Everyone,
> > >
> > > Since last 2 weeks, we're facing an issue with LocalExecutor setup of
> > > Airflow v1.9(MySQL as metastore) where in a DAG if retry has been
> > > configured and initial try_number gets failed, then nearly 8 out of 10
> > > times, task will get stuck in up_for_retry state, in fact there is no
> > > running state coming after Scheduled>Queued in TI. In Job table entry
> > gets
> > > successful within fraction of second and failed entry gets logged in
> > > task_fail table without task even reaching to operator code and as a
> > result
> > > we get aemail alert saying
> > >
> > > ```
> > > Try 2 out of 4
> > > Exception:
> > > Executor reports task instance %s finished (%s) although the task says
> > its
> > > %s. Was the task killed externally?
> > > ```
> > >
> > > But when default value of job_heartbeat_sec changed from 5 to 30
> seconds(
> > > https://groups.google.com/forum/#!topic/airbnb_airflow/hTXKFw2XFx0
> > > mentioned by Max sometimes back in 2016 for healthy supervision), this
> > > issue stops arising. But we're still clueless how this new
> configuration
> > > actually solved/suppressed the issue, any key information around it
> would
> > > really help here.
> > >
> > > Regards,
> > > Vardan Gupta
> > >
> > --
> > (Sent from cellphone)
>
-- 
(Sent from cellphone)

Re: Getting Task Killed Externally

Posted by Vardan Gupta <va...@gmail.com>.
Hi Trent,

Thanks for replying. Though you're suggesting that there might be case
where we might be hitting caps, but on our side, there are hardly any
concurrent tasks, rarely 1-2 at a time with parallelism set to 50. But
yeah, we'll just increase the parallelism and and see if that solves the
problem too.

Thanks,
Vardan Gupta

On Tue, Aug 28, 2018 at 11:17 AM Trent Robbins <ro...@gmail.com> wrote:

> Hi Vardan,
>
> We had this issue - I recommend increasing the parallelism config variable
> to something like 128 or 512. I have no idea what side effects this could
> have. So far, none. This happened to us with LocalExecutor and our
> monitoring showed a clear issue with hitting a cap on number of concurrent
> tasks tasks. I probably should have reported it, but we still aren't sure
> what happened and have not investigated why those tasks are not getting
> kicked back up into the queue or whatever.
>
> You may need to increase other config variables, too, if they also cause
> you to hit caps. Some people are conservative about these variables. If you
> are feeling conservative, you can get some better telemetry into this with
> prometheus and grafana. We followed this route but resolved to just set the
> cap very high and resolve any side effects afterwards.
>
> Best,
> Trent
>
>
> On Mon, Aug 27, 2018 at 21:09 vardanguptacse@gmail.com <
> vardanguptacse@gmail.com> wrote:
>
> > Hi Everyone,
> >
> > Since last 2 weeks, we're facing an issue with LocalExecutor setup of
> > Airflow v1.9(MySQL as metastore) where in a DAG if retry has been
> > configured and initial try_number gets failed, then nearly 8 out of 10
> > times, task will get stuck in up_for_retry state, in fact there is no
> > running state coming after Scheduled>Queued in TI. In Job table entry
> gets
> > successful within fraction of second and failed entry gets logged in
> > task_fail table without task even reaching to operator code and as a
> result
> > we get aemail alert saying
> >
> > ```
> > Try 2 out of 4
> > Exception:
> > Executor reports task instance %s finished (%s) although the task says
> its
> > %s. Was the task killed externally?
> > ```
> >
> > But when default value of job_heartbeat_sec changed from 5 to 30 seconds(
> > https://groups.google.com/forum/#!topic/airbnb_airflow/hTXKFw2XFx0
> > mentioned by Max sometimes back in 2016 for healthy supervision), this
> > issue stops arising. But we're still clueless how this new configuration
> > actually solved/suppressed the issue, any key information around it would
> > really help here.
> >
> > Regards,
> > Vardan Gupta
> >
> --
> (Sent from cellphone)

Re: Getting Task Killed Externally

Posted by Trent Robbins <ro...@gmail.com>.
Hi Vardan,

We had this issue - I recommend increasing the parallelism config variable
to something like 128 or 512. I have no idea what side effects this could
have. So far, none. This happened to us with LocalExecutor and our
monitoring showed a clear issue with hitting a cap on number of concurrent
tasks tasks. I probably should have reported it, but we still aren't sure
what happened and have not investigated why those tasks are not getting
kicked back up into the queue or whatever.

You may need to increase other config variables, too, if they also cause
you to hit caps. Some people are conservative about these variables. If you
are feeling conservative, you can get some better telemetry into this with
prometheus and grafana. We followed this route but resolved to just set the
cap very high and resolve any side effects afterwards.

Best,
Trent


On Mon, Aug 27, 2018 at 21:09 vardanguptacse@gmail.com <
vardanguptacse@gmail.com> wrote:

> Hi Everyone,
>
> Since last 2 weeks, we're facing an issue with LocalExecutor setup of
> Airflow v1.9(MySQL as metastore) where in a DAG if retry has been
> configured and initial try_number gets failed, then nearly 8 out of 10
> times, task will get stuck in up_for_retry state, in fact there is no
> running state coming after Scheduled>Queued in TI. In Job table entry gets
> successful within fraction of second and failed entry gets logged in
> task_fail table without task even reaching to operator code and as a result
> we get aemail alert saying
>
> ```
> Try 2 out of 4
> Exception:
> Executor reports task instance %s finished (%s) although the task says its
> %s. Was the task killed externally?
> ```
>
> But when default value of job_heartbeat_sec changed from 5 to 30 seconds(
> https://groups.google.com/forum/#!topic/airbnb_airflow/hTXKFw2XFx0
> mentioned by Max sometimes back in 2016 for healthy supervision), this
> issue stops arising. But we're still clueless how this new configuration
> actually solved/suppressed the issue, any key information around it would
> really help here.
>
> Regards,
> Vardan Gupta
>
-- 
(Sent from cellphone)