You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by raman gupta <ra...@gmail.com> on 2018/10/03 08:09:13 UTC

Re: Task is Stuck in Up_For_Retry

Hi All,

On furter investigation we found that this issue is reproduced if task
retry_delay is set to very low number say 10 seconds.
 All the retries of a taks get the same key by executor (LocalExecutor )
which is composed of dagid, taskid and execution date. So we are hitting
the scenario where scheduler schedules the next task run while the
executor's event of previous run is not yet processed. So in jobs.py's
process_executor_events function, scheduler associates the last run's
executor event with the current run(which might be in the queued state )
and marks it as failed/up_for_retry.
Due to this task state's follow
up_for_retry->scheduled->queued->failed/up_for_retry transition instead of
following   up_for_retry->scheduled->queued->running->failed/up_for_retry.
And we get to see the logs like "Task is not being able to run" and
"executor reports the task as finished but it is still in queued state"
We have raised a JIRA for this
https://issues.apache.org/jira/browse/AIRFLOW-3136 and are also looking
into potential fix.

Thanks,
Raman Gupta

On Fri, Aug 24, 2018 at 3:56 PM ramandumcs@gmail.com <ra...@gmail.com>
wrote:

> Hi All,
> Any pointer on this would be helpful. We have added extra logs and are
> trying few thing to get the root cause. But we are getting logs like "Task
> is not able to run".
> And we are not getting any resource usage related error.
>
> Thanks,
> Raman Gupta
>
> On 2018/08/21 16:46:56, ramandumcs@gmail.com <ra...@gmail.com>
> wrote:
> > Hi All,
> > As per http://docs.sqlalchemy.org/en/latest/core/connections.html link
> db engine is not portable across process boundaries
> > "For a multiple-process application that uses the os.fork system call,
> or for example the Python multiprocessing module, it’s usually required
> that a separate Engine be used for each child process. This is because the
> Engine maintains a reference to a connection pool that ultimately
> references DBAPI connections - these tend to not be portable across process
> boundaries"
> > Please correct me if I am wrong but It seems that in Airflow 1.9 child
> processes don't create separate DB engine and so there is only one single
> DB Engine which is shared among child processes which might be causing this
> issue.
> >
> > Thanks,
> > Raman Gupta
> >
> >
> > On 2018/08/21 15:41:14, raman gupta <ra...@gmail.com> wrote:
> > > One possibility is the unavailability of session while calling
> > > self.task_instance._check_and_change_state_before_execution
> > > function.
> > > (Session is provided via @provide_session decorator)
> > >
> > > On Tue, Aug 21, 2018 at 7:09 PM vardanguptacse@gmail.com <
> > > vardanguptacse@gmail.com> wrote:
> > >
> > > > Is there any possibility that on call of function
> > > > _check_and_change_state_before_execution at
> > > >
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2500
> ,
> > > > this method is not actually being called
> > > >
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/models.py#L1299
> ?
> > > > Because even in a happy scenario, no logs is printed from method's
> > > > implementation and directly control is reaching here
> > > >
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2512
> > > > while in stuck phase, we are seeing this log
> > > >
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/jobs.py#L2508
> > > > i.e. Task is not able to be run, FYI we've not set any sort of
> dependency
> > > > with dag.
> > > >
> > > > Regards,
> > > > Vardan Gupta
> > > >
> > > > On 2018/08/16 08:25:37, ramandumcs@gmail.com <ra...@gmail.com>
> > > > wrote:
> > > > > Hi All,
> > > > >
> > > > > We are using airflow 1.9 with Local Executor more. Intermittently
> we are
> > > > observing that tasks are getting stuck in "up_for_retry" mode and are
> > > > getting retried again and again exceeding their configured max
> retries
> > > > count. like we have configured max retries as 2 but task is retried
> 15
> > > > times and got stuck in up_for_retry state.
> > > > > Any pointer on this would be helpful.
> > > > >
> > > > > Thanks,
> > > > > Raman Gupta
> > > > >
> > > >
> > >
> >
>