You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by harish singh <ha...@gmail.com> on 2016/05/13 20:19:48 UTC

depends_on_past not working as expected?

Hi guys,

I am having an issue with making 'depends_on_past=true' work

This my pipeline:

a -> b -> c -> d -> e

a -> x -> e

a -> y -> e

I have default args for all Tasks:

scheduling_start_date = (datetime.utcnow() -
datetime.timedelta(hours=1)).replace(minute=0, second=0,
microsecond=0)

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': scheduling_start_date,
    'email': ['airflow@airflow.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': default_retries_delay,
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),}


But specifically for tasks d, x, y , I have depends_on_past = true

 depends_on_past=True


So now:
For the first hour, d, x and y failed.
So I am assuming in the next hour these jobs should not be even tried?
right ?
But I see in the next hour and subsequent hours,  these tasks are getting
triggered (and failing) ...
Should the behavior be : that if a tasks previous execution failed, no
attempt is made during the next run of dag?
Or am I doing something very "bad" here?


Thanks,
Harish

Re: depends_on_past not working as expected?

Posted by Bolke de Bruin <bd...@gmail.com>.
> Op 13 mei 2016, om 23:06 heeft harish singh <ha...@gmail.com> het volgende geschreven:
> 
> we are seeing this in production. I wont be able to update the version
> right now. But I will try to test this out over the weekend.
> But if I consider 1.7.0, am I doing something incorrect? or did something
> change in .1.rc6?

No I wouldn’t consider you are doing something wrong from an initial analysis. However, we did a lot of stability fixes in 1.7.1. Especially in your scenario you might want to try out AIRFLOW-20 (See lira). If you can supply an example dag it will make it easier to help out.


> 
> One thing I forgot to mention was that - we do run a backfill before we
> turn on the DAG.
> So if I have to turn the DAG on right now, I will first run a backfill for
> last 24 hours and then I turn it on (from the UI) so that it gets scheduled
> by the scheduler.
> 
> Nevertheless, I am going to try this scenario on 1.7.1.rc6.
> 
> Thanks!
> 
> 
> On Fri, May 13, 2016 at 1:54 PM, Bolke de Bruin <bd...@gmail.com> wrote:
> 
>> 
>>> Op 13 mei 2016, om 22:51 heeft harish singh <ha...@gmail.com>
>> het volgende geschreven:
>>> 
>>> Bolke, its 1.7.0
>>> 
>>> 
>>> On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin <bd...@gmail.com>
>> wrote:
>>> 
>>>> 
>>>>> Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com>
>>>> het volgende geschreven:
>>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> I am having an issue with making 'depends_on_past=true' work
>>>>> 
>>>>> This my pipeline:
>>>>> 
>>>>> a -> b -> c -> d -> e
>>>>> 
>>>>> a -> x -> e
>>>>> 
>>>>> a -> y -> e
>>>>> 
>>>>> I have default args for all Tasks:
>>>>> 
>>>>> scheduling_start_date = (datetime.utcnow() -
>>>>> datetime.timedelta(hours=1)).replace(minute=0, second=0,
>>>>> microsecond=0)
>>>>> 
>>>>> default_args = {
>>>>>  'owner': 'airflow',
>>>>>  'depends_on_past': False,
>>>>>  'start_date': scheduling_start_date,
>>>>>  'email': ['airflow@airflow.com'],
>>>>>  'email_on_failure': False,
>>>>>  'email_on_retry': False,
>>>>>  'retries': 2,
>>>>>  'retry_delay': default_retries_delay,
>>>>>  # 'queue': 'bash_queue',
>>>>>  # 'pool': 'backfill',
>>>>>  # 'priority_weight': 10,
>>>>>  # 'end_date': datetime(2016, 1, 1),}
>>>>> 
>>>>> 
>>>>> But specifically for tasks d, x, y , I have depends_on_past = true
>>>>> 
>>>>> depends_on_past=True
>>>>> 
>>>>> 
>>>>> So now:
>>>>> For the first hour, d, x and y failed.
>>>>> So I am assuming in the next hour these jobs should not be even tried?
>>>>> right ?
>>>>> But I see in the next hour and subsequent hours,  these tasks are
>> getting
>>>>> triggered (and failing) ...
>>>>> Should the behavior be : that if a tasks previous execution failed, no
>>>>> attempt is made during the next run of dag?
>>>>> Or am I doing something very "bad" here?
>>>> 
>>>> 
>>>> What version are you on Harish?
>>>> 
>>>> 
>> 
>> Can you try 1.7.1.rc6 before w dive in?
>> 
>> 


Re: depends_on_past not working as expected?

Posted by harish singh <ha...@gmail.com>.
we are seeing this in production. I wont be able to update the version
right now. But I will try to test this out over the weekend.
But if I consider 1.7.0, am I doing something incorrect? or did something
change in .1.rc6?

One thing I forgot to mention was that - we do run a backfill before we
turn on the DAG.
So if I have to turn the DAG on right now, I will first run a backfill for
last 24 hours and then I turn it on (from the UI) so that it gets scheduled
by the scheduler.

Nevertheless, I am going to try this scenario on 1.7.1.rc6.

Thanks!


On Fri, May 13, 2016 at 1:54 PM, Bolke de Bruin <bd...@gmail.com> wrote:

>
> > Op 13 mei 2016, om 22:51 heeft harish singh <ha...@gmail.com>
> het volgende geschreven:
> >
> > Bolke, its 1.7.0
> >
> >
> > On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >
> >>
> >>> Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com>
> >> het volgende geschreven:
> >>>
> >>> Hi guys,
> >>>
> >>> I am having an issue with making 'depends_on_past=true' work
> >>>
> >>> This my pipeline:
> >>>
> >>> a -> b -> c -> d -> e
> >>>
> >>> a -> x -> e
> >>>
> >>> a -> y -> e
> >>>
> >>> I have default args for all Tasks:
> >>>
> >>> scheduling_start_date = (datetime.utcnow() -
> >>> datetime.timedelta(hours=1)).replace(minute=0, second=0,
> >>> microsecond=0)
> >>>
> >>> default_args = {
> >>>   'owner': 'airflow',
> >>>   'depends_on_past': False,
> >>>   'start_date': scheduling_start_date,
> >>>   'email': ['airflow@airflow.com'],
> >>>   'email_on_failure': False,
> >>>   'email_on_retry': False,
> >>>   'retries': 2,
> >>>   'retry_delay': default_retries_delay,
> >>>   # 'queue': 'bash_queue',
> >>>   # 'pool': 'backfill',
> >>>   # 'priority_weight': 10,
> >>>   # 'end_date': datetime(2016, 1, 1),}
> >>>
> >>>
> >>> But specifically for tasks d, x, y , I have depends_on_past = true
> >>>
> >>> depends_on_past=True
> >>>
> >>>
> >>> So now:
> >>> For the first hour, d, x and y failed.
> >>> So I am assuming in the next hour these jobs should not be even tried?
> >>> right ?
> >>> But I see in the next hour and subsequent hours,  these tasks are
> getting
> >>> triggered (and failing) ...
> >>> Should the behavior be : that if a tasks previous execution failed, no
> >>> attempt is made during the next run of dag?
> >>> Or am I doing something very "bad" here?
> >>
> >>
> >> What version are you on Harish?
> >>
> >>
>
> Can you try 1.7.1.rc6 before w dive in?
>
>

Re: depends_on_past not working as expected?

Posted by Bolke de Bruin <bd...@gmail.com>.
> Op 13 mei 2016, om 22:51 heeft harish singh <ha...@gmail.com> het volgende geschreven:
> 
> Bolke, its 1.7.0
> 
> 
> On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin <bd...@gmail.com> wrote:
> 
>> 
>>> Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com>
>> het volgende geschreven:
>>> 
>>> Hi guys,
>>> 
>>> I am having an issue with making 'depends_on_past=true' work
>>> 
>>> This my pipeline:
>>> 
>>> a -> b -> c -> d -> e
>>> 
>>> a -> x -> e
>>> 
>>> a -> y -> e
>>> 
>>> I have default args for all Tasks:
>>> 
>>> scheduling_start_date = (datetime.utcnow() -
>>> datetime.timedelta(hours=1)).replace(minute=0, second=0,
>>> microsecond=0)
>>> 
>>> default_args = {
>>>   'owner': 'airflow',
>>>   'depends_on_past': False,
>>>   'start_date': scheduling_start_date,
>>>   'email': ['airflow@airflow.com'],
>>>   'email_on_failure': False,
>>>   'email_on_retry': False,
>>>   'retries': 2,
>>>   'retry_delay': default_retries_delay,
>>>   # 'queue': 'bash_queue',
>>>   # 'pool': 'backfill',
>>>   # 'priority_weight': 10,
>>>   # 'end_date': datetime(2016, 1, 1),}
>>> 
>>> 
>>> But specifically for tasks d, x, y , I have depends_on_past = true
>>> 
>>> depends_on_past=True
>>> 
>>> 
>>> So now:
>>> For the first hour, d, x and y failed.
>>> So I am assuming in the next hour these jobs should not be even tried?
>>> right ?
>>> But I see in the next hour and subsequent hours,  these tasks are getting
>>> triggered (and failing) ...
>>> Should the behavior be : that if a tasks previous execution failed, no
>>> attempt is made during the next run of dag?
>>> Or am I doing something very "bad" here?
>> 
>> 
>> What version are you on Harish?
>> 
>> 

Can you try 1.7.1.rc6 before w dive in?


Re: depends_on_past not working as expected?

Posted by harish singh <ha...@gmail.com>.
Bolke, its 1.7.0


On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin <bd...@gmail.com> wrote:

>
> > Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com>
> het volgende geschreven:
> >
> > Hi guys,
> >
> > I am having an issue with making 'depends_on_past=true' work
> >
> > This my pipeline:
> >
> > a -> b -> c -> d -> e
> >
> > a -> x -> e
> >
> > a -> y -> e
> >
> > I have default args for all Tasks:
> >
> > scheduling_start_date = (datetime.utcnow() -
> > datetime.timedelta(hours=1)).replace(minute=0, second=0,
> > microsecond=0)
> >
> > default_args = {
> >    'owner': 'airflow',
> >    'depends_on_past': False,
> >    'start_date': scheduling_start_date,
> >    'email': ['airflow@airflow.com'],
> >    'email_on_failure': False,
> >    'email_on_retry': False,
> >    'retries': 2,
> >    'retry_delay': default_retries_delay,
> >    # 'queue': 'bash_queue',
> >    # 'pool': 'backfill',
> >    # 'priority_weight': 10,
> >    # 'end_date': datetime(2016, 1, 1),}
> >
> >
> > But specifically for tasks d, x, y , I have depends_on_past = true
> >
> > depends_on_past=True
> >
> >
> > So now:
> > For the first hour, d, x and y failed.
> > So I am assuming in the next hour these jobs should not be even tried?
> > right ?
> > But I see in the next hour and subsequent hours,  these tasks are getting
> > triggered (and failing) ...
> > Should the behavior be : that if a tasks previous execution failed, no
> > attempt is made during the next run of dag?
> > Or am I doing something very "bad" here?
>
>
> What version are you on Harish?
>
>

Re: depends_on_past not working as expected?

Posted by Bolke de Bruin <bd...@gmail.com>.
> Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com> het volgende geschreven:
> 
> Hi guys,
> 
> I am having an issue with making 'depends_on_past=true' work
> 
> This my pipeline:
> 
> a -> b -> c -> d -> e
> 
> a -> x -> e
> 
> a -> y -> e
> 
> I have default args for all Tasks:
> 
> scheduling_start_date = (datetime.utcnow() -
> datetime.timedelta(hours=1)).replace(minute=0, second=0,
> microsecond=0)
> 
> default_args = {
>    'owner': 'airflow',
>    'depends_on_past': False,
>    'start_date': scheduling_start_date,
>    'email': ['airflow@airflow.com'],
>    'email_on_failure': False,
>    'email_on_retry': False,
>    'retries': 2,
>    'retry_delay': default_retries_delay,
>    # 'queue': 'bash_queue',
>    # 'pool': 'backfill',
>    # 'priority_weight': 10,
>    # 'end_date': datetime(2016, 1, 1),}
> 
> 
> But specifically for tasks d, x, y , I have depends_on_past = true
> 
> depends_on_past=True
> 
> 
> So now:
> For the first hour, d, x and y failed.
> So I am assuming in the next hour these jobs should not be even tried?
> right ?
> But I see in the next hour and subsequent hours,  these tasks are getting
> triggered (and failing) ...
> Should the behavior be : that if a tasks previous execution failed, no
> attempt is made during the next run of dag?
> Or am I doing something very "bad" here?


What version are you on Harish?