You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Adam Gutcheon <Ad...@monitor-360.com> on 2016/10/28 15:11:56 UTC

Task not failing when execution_timeout reached

Hello,

I'm having a big, showstopping problem on my airflow installation. When
a task reaches its execution_timeout, I can see the error message in
the task's log, but it never actually fails the task, leaving it in a
running state forever. This is true of any task that has an
execution_timeout set in any dag. I am using the CeleryExecutor. Are
there hidden pitfalls to timeouts I should know about?

Thanks,
Adam G.

Re: Task not failing when execution_timeout reached

Posted by siddharth anand <sa...@apache.org>.

For completeness, there is also support for a DAG run timeout, which is yet
another mechanism - I haven't used it though and I believe it was
introduced in 1.7.x.

-s

On Fri, Oct 28, 2016 at 9:42 AM, siddharth anand <sa...@apache.org> wrote:

> 1.6.x had an infinite retry problem. If you specified a retry count
> greater than 1, the tasks would get retried ad infinitum.
>
> This was fixed in 1.7.x (1.7.1.3 is most recent release).
>
> We use and have been using the *execution_timeout* for over a year.
>
> build_sender_models_spark_job = BashOperator(
>     task_id='build_sender_models_spark_job',
>     execution_timeout=timedelta(hours=3),
>     pool='ep_data_pipeline_spark_tasks_only',
>     bash_command=sender_model_building_command,
>     params={'CLUSTER_IP':PLATFORM_VARS['ip'], 'USER':PLATFORM_VARS['ssh_user'], 'HOME_DIR':PLATFORM_VARS['home_dir'], 'SSH_KEY':SSH_KEY},
>     dag=dag)
>
>
> As an an additional measure, we specify an SLA timeout on the last step
> (last task) of our DAG. We have an hourly DAG, so if the last task for an
> hourly DAG run exceeds 2 hours, we have missed our SLA. For example, for an
> execution date of *20161027T12:00:00Z*, we'd expect the run to start at
> *20161027T13:00:00Z*. By *20161027T15:00:00Z*, we will be notified of the
> SLA miss.
>
> # Operator : Send Email when flow completes successfullysend_email_notification_flow_successful = PythonOperator(
>     task_id='send_email_notification_flow_successful',
>     execution_timeout=timedelta(minutes=15),
>     pool='ep_data_pipeline_metrics_gathering',
>     provide_context=True,
>     sla=timedelta(hours=2),
>     python_callable=send_email_notification_flow_successful,
>     dag=dag)
>
>
> SLAs have been around since 1.6.x or earlier. In 1.7.x, I add a callback
> mechanism to alert on SLA miss. At Agari, we essentially page our on-call
> engineer and write info to Slack.
>
> default_args = {
>     'owner': 'sanand',
>     'depends_on_past': True,
>     'pool': 'ep_data_pipeline',
>     'start_date': START_DATE,
>     'email': [import_ep_pipeline_alert_email_dl],
>     'email_on_failure': import_airflow_enable_notifications,
>     'email_on_retry': import_airflow_enable_notifications,
>     'retries': 10,
>     'retry_delay': timedelta(seconds=30),
>     'priority_weight': import_airflow_priority_weight}dag = DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args, sla_miss_callback=sla_alert_func)
>
>
> You can use SLA as an alternative approach to achieve your goals or in
> tandem with retries, as we do.
> -s
>
>
> On Fri, Oct 28, 2016 at 8:11 AM, Adam Gutcheon <
> Adam_Gutcheon@monitor-360.com> wrote:
>
>> Hello,
>>
>> I'm having a big, showstopping problem on my airflow installation. When
>> a task reaches its execution_timeout, I can see the error message in
>> the task's log, but it never actually fails the task, leaving it in a
>> running state forever. This is true of any task that has an
>> execution_timeout set in any dag. I am using the CeleryExecutor. Are
>> there hidden pitfalls to timeouts I should know about?
>>
>> Thanks,
>> Adam G.
>>
>>
>

Re: Task not failing when execution_timeout reached

Posted by siddharth anand <sa...@apache.org>.

1.6.x had an infinite retry problem. If you specified a retry count greater
than 1, the tasks would get retried ad infinitum.

This was fixed in 1.7.x (1.7.1.3 is most recent release).

We use and have been using the *execution_timeout* for over a year.

build_sender_models_spark_job = BashOperator(
    task_id='build_sender_models_spark_job',
    execution_timeout=timedelta(hours=3),
    pool='ep_data_pipeline_spark_tasks_only',
    bash_command=sender_model_building_command,
    params={'CLUSTER_IP':PLATFORM_VARS['ip'],
'USER':PLATFORM_VARS['ssh_user'],
'HOME_DIR':PLATFORM_VARS['home_dir'], 'SSH_KEY':SSH_KEY},
    dag=dag)


As an an additional measure, we specify an SLA timeout on the last step
(last task) of our DAG. We have an hourly DAG, so if the last task for an
hourly DAG run exceeds 2 hours, we have missed our SLA. For example, for an
execution date of *20161027T12:00:00Z*, we'd expect the run to start at
*20161027T13:00:00Z*. By *20161027T15:00:00Z*, we will be notified of the
SLA miss.

# Operator : Send Email when flow completes
successfullysend_email_notification_flow_successful = PythonOperator(
    task_id='send_email_notification_flow_successful',
    execution_timeout=timedelta(minutes=15),
    pool='ep_data_pipeline_metrics_gathering',
    provide_context=True,
    sla=timedelta(hours=2),
    python_callable=send_email_notification_flow_successful,
    dag=dag)


SLAs have been around since 1.6.x or earlier. In 1.7.x, I add a callback
mechanism to alert on SLA miss. At Agari, we essentially page our on-call
engineer and write info to Slack.

default_args = {
    'owner': 'sanand',
    'depends_on_past': True,
    'pool': 'ep_data_pipeline',
    'start_date': START_DATE,
    'email': [import_ep_pipeline_alert_email_dl],
    'email_on_failure': import_airflow_enable_notifications,
    'email_on_retry': import_airflow_enable_notifications,
    'retries': 10,
    'retry_delay': timedelta(seconds=30),
    'priority_weight': import_airflow_priority_weight}dag =
DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args,
sla_miss_callback=sla_alert_func)


You can use SLA as an alternative approach to achieve your goals or in
tandem with retries, as we do.
-s


On Fri, Oct 28, 2016 at 8:11 AM, Adam Gutcheon <
Adam_Gutcheon@monitor-360.com> wrote:

> Hello,
>
> I'm having a big, showstopping problem on my airflow installation. When
> a task reaches its execution_timeout, I can see the error message in
> the task's log, but it never actually fails the task, leaving it in a
> running state forever. This is true of any task that has an
> execution_timeout set in any dag. I am using the CeleryExecutor. Are
> there hidden pitfalls to timeouts I should know about?
>
> Thanks,
> Adam G.
>
>