You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by ra...@gmail.com, ra...@gmail.com on 2018/08/16 08:25:37 UTC

Task is Stuck in Up_For_Retry

Hi All,

We are using airflow 1.9 with Local Executor more. Intermittently we are observing that tasks are getting stuck in "up_for_retry" mode and are getting retried again and again exceeding their configured max retries count. like we have configured max retries as 2 but task is retried 15 times and got stuck in up_for_retry state.
Any pointer on this would be helpful.

Thanks,
Raman Gupta

Re: Task is Stuck in Up_For_Retry

Posted by raman gupta <ra...@gmail.com>.
Hi All,

On furter investigation we found that this issue is reproduced if task
retry_delay is set to very low number say 10 seconds.
 All the retries of a taks get the same key by executor (LocalExecutor )
which is composed of dagid, taskid and execution date. So we are hitting
the scenario where scheduler schedules the next task run while the
executor's event of previous run is not yet processed. So in jobs.py's
process_executor_events function, scheduler associates the last run's
executor event with the current run(which might be in the queued state )
and marks it as failed/up_for_retry.
Due to this task state's follow
up_for_retry->scheduled->queued->failed/up_for_retry transition instead of
following   up_for_retry->scheduled->queued->running->failed/up_for_retry.
And we get to see the logs like "Task is not being able to run" and
"executor reports the task as finished but it is still in queued state"
We have raised a JIRA for this
https://issues.apache.org/jira/browse/AIRFLOW-3136 and are also looking
into potential fix.

Thanks,
Raman Gupta

On Fri, Aug 24, 2018 at 3:56 PM ramandumcs@gmail.com <ra...@gmail.com>
wrote:

> Hi All,
> Any pointer on this would be helpful. We have added extra logs and are
> trying few thing to get the root cause. But we are getting logs like "Task
> is not able to run".
> And we are not getting any resource usage related error.
>
> Thanks,
> Raman Gupta
>
> On 2018/08/21 16:46:56, ramandumcs@gmail.com <ra...@gmail.com>
> wrote:
> > Hi All,
> > As per http://docs.sqlalchemy.org/en/latest/core/connections.html link
> db engine is not portable across process boundaries
> > "For a multiple-process application that uses the os.fork system call,
> or for example the Python multiprocessing module, it’s usually required
> that a separate Engine be used for each child process. This is because the
> Engine maintains a reference to a connection pool that ultimately
> references DBAPI connections - these tend to not be portable across process
> boundaries"
> > Please correct me if I am wrong but It seems that in Airflow 1.9 child
> processes don't create separate DB engine and so there is only one single
> DB Engine which is shared among child processes which might be causing this
> issue.
> >
> > Thanks,
> > Raman Gupta
> >
> >
> > On 2018/08/21 15:41:14, raman gupta <ra...@gmail.com> wrote:
> > > One possibility is the unavailability of session while calling
> > > self.task_instance._check_and_change_state_before_execution
> > > function.
> > > (Session is provided via @provide_session decorator)
> > >
> > > On Tue, Aug 21, 2018 at 7:09 PM vardanguptacse@gmail.com <
> > > vardanguptacse@gmail.com> wrote:
> > >
> > > > Is there any possibility that on call of function
> > > > _check_and_change_state_before_execution at
> > > >
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2500
> ,
> > > > this method is not actually being called
> > > >
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/models.py#L1299
> ?
> > > > Because even in a happy scenario, no logs is printed from method's
> > > > implementation and directly control is reaching here
> > > >
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2512
> > > > while in stuck phase, we are seeing this log
> > > >
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2508
> > > > i.e. Task is not able to be run, FYI we've not set any sort of
> dependency
> > > > with dag.
> > > >
> > > > Regards,
> > > > Vardan Gupta
> > > >
> > > > On 2018/08/16 08:25:37, ramandumcs@gmail.com <ra...@gmail.com>
> > > > wrote:
> > > > > Hi All,
> > > > >
> > > > > We are using airflow 1.9 with Local Executor more. Intermittently
> we are
> > > > observing that tasks are getting stuck in "up_for_retry" mode and are
> > > > getting retried again and again exceeding their configured max
> retries
> > > > count. like we have configured max retries as 2 but task is retried
> 15
> > > > times and got stuck in up_for_retry state.
> > > > > Any pointer on this would be helpful.
> > > > >
> > > > > Thanks,
> > > > > Raman Gupta
> > > > >
> > > >
> > >
> >
>

Re: Task is Stuck in Up_For_Retry

Posted by ra...@gmail.com, ra...@gmail.com.
Hi All,
Any pointer on this would be helpful. We have added extra logs and are trying few thing to get the root cause. But we are getting logs like "Task is not able to run".
And we are not getting any resource usage related error.

Thanks,
Raman Gupta  

On 2018/08/21 16:46:56, ramandumcs@gmail.com <ra...@gmail.com> wrote: 
> Hi All,
> As per http://docs.sqlalchemy.org/en/latest/core/connections.html link db engine is not portable across process boundaries
> "For a multiple-process application that uses the os.fork system call, or for example the Python multiprocessing module, it’s usually required that a separate Engine be used for each child process. This is because the Engine maintains a reference to a connection pool that ultimately references DBAPI connections - these tend to not be portable across process boundaries"
> Please correct me if I am wrong but It seems that in Airflow 1.9 child processes don't create separate DB engine and so there is only one single DB Engine which is shared among child processes which might be causing this issue.
> 
> Thanks,
> Raman Gupta 
> 
> 
> On 2018/08/21 15:41:14, raman gupta <ra...@gmail.com> wrote: 
> > One possibility is the unavailability of session while calling
> > self.task_instance._check_and_change_state_before_execution
> > function.
> > (Session is provided via @provide_session decorator)
> > 
> > On Tue, Aug 21, 2018 at 7:09 PM vardanguptacse@gmail.com <
> > vardanguptacse@gmail.com> wrote:
> > 
> > > Is there any possibility that on call of function
> > > _check_and_change_state_before_execution at
> > > https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2500,
> > > this method is not actually being called
> > > https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/models.py#L1299?
> > > Because even in a happy scenario, no logs is printed from method's
> > > implementation and directly control is reaching here
> > > https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2512
> > > while in stuck phase, we are seeing this log
> > > https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2508
> > > i.e. Task is not able to be run, FYI we've not set any sort of dependency
> > > with dag.
> > >
> > > Regards,
> > > Vardan Gupta
> > >
> > > On 2018/08/16 08:25:37, ramandumcs@gmail.com <ra...@gmail.com>
> > > wrote:
> > > > Hi All,
> > > >
> > > > We are using airflow 1.9 with Local Executor more. Intermittently we are
> > > observing that tasks are getting stuck in "up_for_retry" mode and are
> > > getting retried again and again exceeding their configured max retries
> > > count. like we have configured max retries as 2 but task is retried 15
> > > times and got stuck in up_for_retry state.
> > > > Any pointer on this would be helpful.
> > > >
> > > > Thanks,
> > > > Raman Gupta
> > > >
> > >
> > 
> 

Re: Task is Stuck in Up_For_Retry

Posted by ra...@gmail.com, ra...@gmail.com.
Hi All,
As per http://docs.sqlalchemy.org/en/latest/core/connections.html link db engine is not portable across process boundaries
"For a multiple-process application that uses the os.fork system call, or for example the Python multiprocessing module, it’s usually required that a separate Engine be used for each child process. This is because the Engine maintains a reference to a connection pool that ultimately references DBAPI connections - these tend to not be portable across process boundaries"
Please correct me if I am wrong but It seems that in Airflow 1.9 child processes don't create separate DB engine and so there is only one single DB Engine which is shared among child processes which might be causing this issue.

Thanks,
Raman Gupta 


On 2018/08/21 15:41:14, raman gupta <ra...@gmail.com> wrote: 
> One possibility is the unavailability of session while calling
> self.task_instance._check_and_change_state_before_execution
> function.
> (Session is provided via @provide_session decorator)
> 
> On Tue, Aug 21, 2018 at 7:09 PM vardanguptacse@gmail.com <
> vardanguptacse@gmail.com> wrote:
> 
> > Is there any possibility that on call of function
> > _check_and_change_state_before_execution at
> > https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2500,
> > this method is not actually being called
> > https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/models.py#L1299?
> > Because even in a happy scenario, no logs is printed from method's
> > implementation and directly control is reaching here
> > https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2512
> > while in stuck phase, we are seeing this log
> > https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2508
> > i.e. Task is not able to be run, FYI we've not set any sort of dependency
> > with dag.
> >
> > Regards,
> > Vardan Gupta
> >
> > On 2018/08/16 08:25:37, ramandumcs@gmail.com <ra...@gmail.com>
> > wrote:
> > > Hi All,
> > >
> > > We are using airflow 1.9 with Local Executor more. Intermittently we are
> > observing that tasks are getting stuck in "up_for_retry" mode and are
> > getting retried again and again exceeding their configured max retries
> > count. like we have configured max retries as 2 but task is retried 15
> > times and got stuck in up_for_retry state.
> > > Any pointer on this would be helpful.
> > >
> > > Thanks,
> > > Raman Gupta
> > >
> >
> 

Re: Task is Stuck in Up_For_Retry

Posted by raman gupta <ra...@gmail.com>.
One possibility is the unavailability of session while calling
self.task_instance._check_and_change_state_before_execution
function.
(Session is provided via @provide_session decorator)

On Tue, Aug 21, 2018 at 7:09 PM vardanguptacse@gmail.com <
vardanguptacse@gmail.com> wrote:

> Is there any possibility that on call of function
> _check_and_change_state_before_execution at
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2500,
> this method is not actually being called
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/models.py#L1299?
> Because even in a happy scenario, no logs is printed from method's
> implementation and directly control is reaching here
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2512
> while in stuck phase, we are seeing this log
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2508
> i.e. Task is not able to be run, FYI we've not set any sort of dependency
> with dag.
>
> Regards,
> Vardan Gupta
>
> On 2018/08/16 08:25:37, ramandumcs@gmail.com <ra...@gmail.com>
> wrote:
> > Hi All,
> >
> > We are using airflow 1.9 with Local Executor more. Intermittently we are
> observing that tasks are getting stuck in "up_for_retry" mode and are
> getting retried again and again exceeding their configured max retries
> count. like we have configured max retries as 2 but task is retried 15
> times and got stuck in up_for_retry state.
> > Any pointer on this would be helpful.
> >
> > Thanks,
> > Raman Gupta
> >
>

Re: Task is Stuck in Up_For_Retry

Posted by va...@gmail.com, va...@gmail.com.
Is there any possibility that on call of function _check_and_change_state_before_execution at https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2500, this method is not actually being called https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/models.py#L1299? Because even in a happy scenario, no logs is printed from method's implementation and directly control is reaching here https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2512 while in stuck phase, we are seeing this log https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2508 i.e. Task is not able to be run, FYI we've not set any sort of dependency with dag.

Regards,
Vardan Gupta

On 2018/08/16 08:25:37, ramandumcs@gmail.com <ra...@gmail.com> wrote: 
> Hi All,
> 
> We are using airflow 1.9 with Local Executor more. Intermittently we are observing that tasks are getting stuck in "up_for_retry" mode and are getting retried again and again exceeding their configured max retries count. like we have configured max retries as 2 but task is retried 15 times and got stuck in up_for_retry state.
> Any pointer on this would be helpful.
> 
> Thanks,
> Raman Gupta
> 

Re: Task is Stuck in Up_For_Retry

Posted by ra...@gmail.com, ra...@gmail.com.
We are getting the logs like

{local_executor.py:43} INFO - LocalWorker running airflow run <TaskCommand>
 {models.py:1595} ERROR - Executor reports task instance %s finished (%s) although the task says its %s. Was the task killed externally?
{models.py:1616} INFO - Marking task as UP_FOR_RETRY

It seems that in  LocalExecutor.py file "subprocess_checkcall" command is returning successfully without throwing an error but Task state did not get updated from queued to running in mysql store.
One possibility is that "_check_and_change_state_before_execution" function in models.py is returning false due to which task state is not getting updated to "running" state and task get stuck in up_for_retry in handle_failure function.

Other possibility could be unavailability of system resources to create a new process inside "subprocess_checkcall" function call but I think in that case it would be throwing an error.  

Is there a way to enable the "verbose" logging flag in _check_and_change_state_before_execution call. It would help us in debugging it further.

Thanks,
Raman Gupta


On 2018/08/17 15:22:27, Matthias Huschle <ma...@paymill.de> wrote: 
> Hi Raman,
> 
> Does it happen only occasionally, or can it be easily reproduced?
> What happens if you start it with "airflow run" or " airflow test"?
> What is in the logs about it?
> What is your user process limit ("ulimit -u") on that machine?
> 
> 
> 2018-08-17 15:39 GMT+02:00 ramandumcs@gmail.com <ra...@gmail.com>:
> 
> > Thanks Taylor,
> > We are getting this issue even after restart. We are observing that task
> > instance state is transitioned from
> > scheduled->queued->up_for_retry and dag gets stuck in up_for_retry state.
> > Behind the scenes executor keep on retrying the dag's task exceeding the
> > max retry limit.
> > In normal scenario task state should have following transition
> > scheduled->queued->running->up_for_retry
> > but we are seeing task is not entered in to running state rather it moves
> > directly to up_for_retry from queued state.
> > Any pointer on this would be helpful.
> >
> > Thanks,
> > raman Gupta
> >
> >
> > On 2018/08/16 16:05:31, Taylor Edmiston <te...@gmail.com> wrote:
> > > Does a scheduler restart make a difference?
> > >
> > > *Taylor Edmiston*
> > > Blog <https://blog.tedmiston.com/> | CV
> > > <https://stackoverflow.com/cv/taylor> | LinkedIn
> > > <https://www.linkedin.com/in/tedmiston/> | AngelList
> > > <https://angel.co/taylor> | Stack Overflow
> > > <https://stackoverflow.com/users/149428/taylor-edmiston>
> > >
> > >
> > > On Thu, Aug 16, 2018 at 4:25 AM, ramandumcs@gmail.com <
> > ramandumcs@gmail.com>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > We are using airflow 1.9 with Local Executor more. Intermittently we
> > are
> > > > observing that tasks are getting stuck in "up_for_retry" mode and are
> > > > getting retried again and again exceeding their configured max retries
> > > > count. like we have configured max retries as 2 but task is retried 15
> > > > times and got stuck in up_for_retry state.
> > > > Any pointer on this would be helpful.
> > > >
> > > > Thanks,
> > > > Raman Gupta
> > > >
> > >
> >
> 
> 
> 
> -- 
> 
> *< Dr. Matthias Huschle />*
> 
> Data and Analytics Manager
> 
> 
> 
> 
> 
> E: matthias.huschle@paymill.de
> 
> Connect with me LinkedIn
> <https://de.linkedin.com/pub/matthias-huschle/b0/912/91>| Xing
> <https://www.xing.com/profile/Matthias_Huschle>
> 
> 
> Credit & debit cards or PayPal? Now you can have both through the PAYMILL
> API. Find out more » <https://www.paymill.com/payment-methods/paypal>
> 
> 
> <https://paymill.atlassian.net/wiki/www.paymill.com/payment-methods/paypal>PAYMILL
> GmbH | St.-Martin-Straße 63, 81669 München <https://www.paymill.com/>
> <https://paymill.atlassian.net/wiki/www.paymill.com/payment-methods/paypal>
> 
> Follow us @PAYMILL <https://twitter.com/Paymill> | www.PAYMILL.com
> <http://www.paymill.com/>
> 
> 
> The information in this e-mail is confidential and may be protected by
> professional secrecy. It is intended solely for the addressee. Any access
> to this e-mail is prohibited by persons other than the addressee. If you
> are not the named addressee, you are prohibited to make any attempt to
> publicise, reproduce or distribute this e-mail, this includes the taking or
> refraining of action in regards to the information obtained. Please notify
> the sender by e-mail immediately if you have received this e-mail by
> mistake and delete this e-mail from your system.
> PAYMILL GmbH | Sitz München | Amtsgericht München | HRB 226526 |
> Steuernummer: 143/169/70894 | USt-ID: DE308345749 | Geschäftsführer: Daniel
> Georges.
> 

Re: Task is Stuck in Up_For_Retry

Posted by Matthias Huschle <ma...@paymill.de>.
Hi Raman,

Does it happen only occasionally, or can it be easily reproduced?
What happens if you start it with "airflow run" or " airflow test"?
What is in the logs about it?
What is your user process limit ("ulimit -u") on that machine?


2018-08-17 15:39 GMT+02:00 ramandumcs@gmail.com <ra...@gmail.com>:

> Thanks Taylor,
> We are getting this issue even after restart. We are observing that task
> instance state is transitioned from
> scheduled->queued->up_for_retry and dag gets stuck in up_for_retry state.
> Behind the scenes executor keep on retrying the dag's task exceeding the
> max retry limit.
> In normal scenario task state should have following transition
> scheduled->queued->running->up_for_retry
> but we are seeing task is not entered in to running state rather it moves
> directly to up_for_retry from queued state.
> Any pointer on this would be helpful.
>
> Thanks,
> raman Gupta
>
>
> On 2018/08/16 16:05:31, Taylor Edmiston <te...@gmail.com> wrote:
> > Does a scheduler restart make a difference?
> >
> > *Taylor Edmiston*
> > Blog <https://blog.tedmiston.com/> | CV
> > <https://stackoverflow.com/cv/taylor> | LinkedIn
> > <https://www.linkedin.com/in/tedmiston/> | AngelList
> > <https://angel.co/taylor> | Stack Overflow
> > <https://stackoverflow.com/users/149428/taylor-edmiston>
> >
> >
> > On Thu, Aug 16, 2018 at 4:25 AM, ramandumcs@gmail.com <
> ramandumcs@gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > We are using airflow 1.9 with Local Executor more. Intermittently we
> are
> > > observing that tasks are getting stuck in "up_for_retry" mode and are
> > > getting retried again and again exceeding their configured max retries
> > > count. like we have configured max retries as 2 but task is retried 15
> > > times and got stuck in up_for_retry state.
> > > Any pointer on this would be helpful.
> > >
> > > Thanks,
> > > Raman Gupta
> > >
> >
>



-- 

*< Dr. Matthias Huschle />*

Data and Analytics Manager





E: matthias.huschle@paymill.de

Connect with me LinkedIn
<https://de.linkedin.com/pub/matthias-huschle/b0/912/91>| Xing
<https://www.xing.com/profile/Matthias_Huschle>


Credit & debit cards or PayPal? Now you can have both through the PAYMILL
API. Find out more » <https://www.paymill.com/payment-methods/paypal>


<https://paymill.atlassian.net/wiki/www.paymill.com/payment-methods/paypal>PAYMILL
GmbH | St.-Martin-Straße 63, 81669 München <https://www.paymill.com/>
<https://paymill.atlassian.net/wiki/www.paymill.com/payment-methods/paypal>

Follow us @PAYMILL <https://twitter.com/Paymill> | www.PAYMILL.com
<http://www.paymill.com/>


The information in this e-mail is confidential and may be protected by
professional secrecy. It is intended solely for the addressee. Any access
to this e-mail is prohibited by persons other than the addressee. If you
are not the named addressee, you are prohibited to make any attempt to
publicise, reproduce or distribute this e-mail, this includes the taking or
refraining of action in regards to the information obtained. Please notify
the sender by e-mail immediately if you have received this e-mail by
mistake and delete this e-mail from your system.
PAYMILL GmbH | Sitz München | Amtsgericht München | HRB 226526 |
Steuernummer: 143/169/70894 | USt-ID: DE308345749 | Geschäftsführer: Daniel
Georges.

Re: Task is Stuck in Up_For_Retry

Posted by ra...@gmail.com, ra...@gmail.com.
Thanks Taylor,
We are getting this issue even after restart. We are observing that task instance state is transitioned from
scheduled->queued->up_for_retry and dag gets stuck in up_for_retry state. Behind the scenes executor keep on retrying the dag's task exceeding the max retry limit. 
In normal scenario task state should have following transition
scheduled->queued->running->up_for_retry
but we are seeing task is not entered in to running state rather it moves directly to up_for_retry from queued state.
Any pointer on this would be helpful.

Thanks,
raman Gupta


On 2018/08/16 16:05:31, Taylor Edmiston <te...@gmail.com> wrote: 
> Does a scheduler restart make a difference?
> 
> *Taylor Edmiston*
> Blog <https://blog.tedmiston.com/> | CV
> <https://stackoverflow.com/cv/taylor> | LinkedIn
> <https://www.linkedin.com/in/tedmiston/> | AngelList
> <https://angel.co/taylor> | Stack Overflow
> <https://stackoverflow.com/users/149428/taylor-edmiston>
> 
> 
> On Thu, Aug 16, 2018 at 4:25 AM, ramandumcs@gmail.com <ra...@gmail.com>
> wrote:
> 
> > Hi All,
> >
> > We are using airflow 1.9 with Local Executor more. Intermittently we are
> > observing that tasks are getting stuck in "up_for_retry" mode and are
> > getting retried again and again exceeding their configured max retries
> > count. like we have configured max retries as 2 but task is retried 15
> > times and got stuck in up_for_retry state.
> > Any pointer on this would be helpful.
> >
> > Thanks,
> > Raman Gupta
> >
> 

Re: Task is Stuck in Up_For_Retry

Posted by Taylor Edmiston <te...@gmail.com>.
Does a scheduler restart make a difference?

*Taylor Edmiston*
Blog <https://blog.tedmiston.com/> | CV
<https://stackoverflow.com/cv/taylor> | LinkedIn
<https://www.linkedin.com/in/tedmiston/> | AngelList
<https://angel.co/taylor> | Stack Overflow
<https://stackoverflow.com/users/149428/taylor-edmiston>


On Thu, Aug 16, 2018 at 4:25 AM, ramandumcs@gmail.com <ra...@gmail.com>
wrote:

> Hi All,
>
> We are using airflow 1.9 with Local Executor more. Intermittently we are
> observing that tasks are getting stuck in "up_for_retry" mode and are
> getting retried again and again exceeding their configured max retries
> count. like we have configured max retries as 2 but task is retried 15
> times and got stuck in up_for_retry state.
> Any pointer on this would be helpful.
>
> Thanks,
> Raman Gupta
>