You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by harish singh <ha...@gmail.com> on 2016/05/13 20:19:48 UTC
depends_on_past not working as expected?
Hi guys,
I am having an issue with making 'depends_on_past=true' work
This my pipeline:
a -> b -> c -> d -> e
a -> x -> e
a -> y -> e
I have default args for all Tasks:
scheduling_start_date = (datetime.utcnow() -
datetime.timedelta(hours=1)).replace(minute=0, second=0,
microsecond=0)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': scheduling_start_date,
'email': ['airflow@airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': default_retries_delay,
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),}
But specifically for tasks d, x, y , I have depends_on_past = true
depends_on_past=True
So now:
For the first hour, d, x and y failed.
So I am assuming in the next hour these jobs should not be even tried?
right ?
But I see in the next hour and subsequent hours, these tasks are getting
triggered (and failing) ...
Should the behavior be : that if a tasks previous execution failed, no
attempt is made during the next run of dag?
Or am I doing something very "bad" here?
Thanks,
Harish
Re: depends_on_past not working as expected?
Posted by Bolke de Bruin <bd...@gmail.com>.
> Op 13 mei 2016, om 23:06 heeft harish singh <ha...@gmail.com> het volgende geschreven:
>
> we are seeing this in production. I wont be able to update the version
> right now. But I will try to test this out over the weekend.
> But if I consider 1.7.0, am I doing something incorrect? or did something
> change in .1.rc6?
No I wouldn’t consider you are doing something wrong from an initial analysis. However, we did a lot of stability fixes in 1.7.1. Especially in your scenario you might want to try out AIRFLOW-20 (See lira). If you can supply an example dag it will make it easier to help out.
>
> One thing I forgot to mention was that - we do run a backfill before we
> turn on the DAG.
> So if I have to turn the DAG on right now, I will first run a backfill for
> last 24 hours and then I turn it on (from the UI) so that it gets scheduled
> by the scheduler.
>
> Nevertheless, I am going to try this scenario on 1.7.1.rc6.
>
> Thanks!
>
>
> On Fri, May 13, 2016 at 1:54 PM, Bolke de Bruin <bd...@gmail.com> wrote:
>
>>
>>> Op 13 mei 2016, om 22:51 heeft harish singh <ha...@gmail.com>
>> het volgende geschreven:
>>>
>>> Bolke, its 1.7.0
>>>
>>>
>>> On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin <bd...@gmail.com>
>> wrote:
>>>
>>>>
>>>>> Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com>
>>>> het volgende geschreven:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I am having an issue with making 'depends_on_past=true' work
>>>>>
>>>>> This my pipeline:
>>>>>
>>>>> a -> b -> c -> d -> e
>>>>>
>>>>> a -> x -> e
>>>>>
>>>>> a -> y -> e
>>>>>
>>>>> I have default args for all Tasks:
>>>>>
>>>>> scheduling_start_date = (datetime.utcnow() -
>>>>> datetime.timedelta(hours=1)).replace(minute=0, second=0,
>>>>> microsecond=0)
>>>>>
>>>>> default_args = {
>>>>> 'owner': 'airflow',
>>>>> 'depends_on_past': False,
>>>>> 'start_date': scheduling_start_date,
>>>>> 'email': ['airflow@airflow.com'],
>>>>> 'email_on_failure': False,
>>>>> 'email_on_retry': False,
>>>>> 'retries': 2,
>>>>> 'retry_delay': default_retries_delay,
>>>>> # 'queue': 'bash_queue',
>>>>> # 'pool': 'backfill',
>>>>> # 'priority_weight': 10,
>>>>> # 'end_date': datetime(2016, 1, 1),}
>>>>>
>>>>>
>>>>> But specifically for tasks d, x, y , I have depends_on_past = true
>>>>>
>>>>> depends_on_past=True
>>>>>
>>>>>
>>>>> So now:
>>>>> For the first hour, d, x and y failed.
>>>>> So I am assuming in the next hour these jobs should not be even tried?
>>>>> right ?
>>>>> But I see in the next hour and subsequent hours, these tasks are
>> getting
>>>>> triggered (and failing) ...
>>>>> Should the behavior be : that if a tasks previous execution failed, no
>>>>> attempt is made during the next run of dag?
>>>>> Or am I doing something very "bad" here?
>>>>
>>>>
>>>> What version are you on Harish?
>>>>
>>>>
>>
>> Can you try 1.7.1.rc6 before w dive in?
>>
>>
Re: depends_on_past not working as expected?
Posted by harish singh <ha...@gmail.com>.
we are seeing this in production. I wont be able to update the version
right now. But I will try to test this out over the weekend.
But if I consider 1.7.0, am I doing something incorrect? or did something
change in .1.rc6?
One thing I forgot to mention was that - we do run a backfill before we
turn on the DAG.
So if I have to turn the DAG on right now, I will first run a backfill for
last 24 hours and then I turn it on (from the UI) so that it gets scheduled
by the scheduler.
Nevertheless, I am going to try this scenario on 1.7.1.rc6.
Thanks!
On Fri, May 13, 2016 at 1:54 PM, Bolke de Bruin <bd...@gmail.com> wrote:
>
> > Op 13 mei 2016, om 22:51 heeft harish singh <ha...@gmail.com>
> het volgende geschreven:
> >
> > Bolke, its 1.7.0
> >
> >
> > On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >
> >>
> >>> Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com>
> >> het volgende geschreven:
> >>>
> >>> Hi guys,
> >>>
> >>> I am having an issue with making 'depends_on_past=true' work
> >>>
> >>> This my pipeline:
> >>>
> >>> a -> b -> c -> d -> e
> >>>
> >>> a -> x -> e
> >>>
> >>> a -> y -> e
> >>>
> >>> I have default args for all Tasks:
> >>>
> >>> scheduling_start_date = (datetime.utcnow() -
> >>> datetime.timedelta(hours=1)).replace(minute=0, second=0,
> >>> microsecond=0)
> >>>
> >>> default_args = {
> >>> 'owner': 'airflow',
> >>> 'depends_on_past': False,
> >>> 'start_date': scheduling_start_date,
> >>> 'email': ['airflow@airflow.com'],
> >>> 'email_on_failure': False,
> >>> 'email_on_retry': False,
> >>> 'retries': 2,
> >>> 'retry_delay': default_retries_delay,
> >>> # 'queue': 'bash_queue',
> >>> # 'pool': 'backfill',
> >>> # 'priority_weight': 10,
> >>> # 'end_date': datetime(2016, 1, 1),}
> >>>
> >>>
> >>> But specifically for tasks d, x, y , I have depends_on_past = true
> >>>
> >>> depends_on_past=True
> >>>
> >>>
> >>> So now:
> >>> For the first hour, d, x and y failed.
> >>> So I am assuming in the next hour these jobs should not be even tried?
> >>> right ?
> >>> But I see in the next hour and subsequent hours, these tasks are
> getting
> >>> triggered (and failing) ...
> >>> Should the behavior be : that if a tasks previous execution failed, no
> >>> attempt is made during the next run of dag?
> >>> Or am I doing something very "bad" here?
> >>
> >>
> >> What version are you on Harish?
> >>
> >>
>
> Can you try 1.7.1.rc6 before w dive in?
>
>
Re: depends_on_past not working as expected?
Posted by Bolke de Bruin <bd...@gmail.com>.
> Op 13 mei 2016, om 22:51 heeft harish singh <ha...@gmail.com> het volgende geschreven:
>
> Bolke, its 1.7.0
>
>
> On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin <bd...@gmail.com> wrote:
>
>>
>>> Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com>
>> het volgende geschreven:
>>>
>>> Hi guys,
>>>
>>> I am having an issue with making 'depends_on_past=true' work
>>>
>>> This my pipeline:
>>>
>>> a -> b -> c -> d -> e
>>>
>>> a -> x -> e
>>>
>>> a -> y -> e
>>>
>>> I have default args for all Tasks:
>>>
>>> scheduling_start_date = (datetime.utcnow() -
>>> datetime.timedelta(hours=1)).replace(minute=0, second=0,
>>> microsecond=0)
>>>
>>> default_args = {
>>> 'owner': 'airflow',
>>> 'depends_on_past': False,
>>> 'start_date': scheduling_start_date,
>>> 'email': ['airflow@airflow.com'],
>>> 'email_on_failure': False,
>>> 'email_on_retry': False,
>>> 'retries': 2,
>>> 'retry_delay': default_retries_delay,
>>> # 'queue': 'bash_queue',
>>> # 'pool': 'backfill',
>>> # 'priority_weight': 10,
>>> # 'end_date': datetime(2016, 1, 1),}
>>>
>>>
>>> But specifically for tasks d, x, y , I have depends_on_past = true
>>>
>>> depends_on_past=True
>>>
>>>
>>> So now:
>>> For the first hour, d, x and y failed.
>>> So I am assuming in the next hour these jobs should not be even tried?
>>> right ?
>>> But I see in the next hour and subsequent hours, these tasks are getting
>>> triggered (and failing) ...
>>> Should the behavior be : that if a tasks previous execution failed, no
>>> attempt is made during the next run of dag?
>>> Or am I doing something very "bad" here?
>>
>>
>> What version are you on Harish?
>>
>>
Can you try 1.7.1.rc6 before w dive in?
Re: depends_on_past not working as expected?
Posted by harish singh <ha...@gmail.com>.
Bolke, its 1.7.0
On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin <bd...@gmail.com> wrote:
>
> > Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com>
> het volgende geschreven:
> >
> > Hi guys,
> >
> > I am having an issue with making 'depends_on_past=true' work
> >
> > This my pipeline:
> >
> > a -> b -> c -> d -> e
> >
> > a -> x -> e
> >
> > a -> y -> e
> >
> > I have default args for all Tasks:
> >
> > scheduling_start_date = (datetime.utcnow() -
> > datetime.timedelta(hours=1)).replace(minute=0, second=0,
> > microsecond=0)
> >
> > default_args = {
> > 'owner': 'airflow',
> > 'depends_on_past': False,
> > 'start_date': scheduling_start_date,
> > 'email': ['airflow@airflow.com'],
> > 'email_on_failure': False,
> > 'email_on_retry': False,
> > 'retries': 2,
> > 'retry_delay': default_retries_delay,
> > # 'queue': 'bash_queue',
> > # 'pool': 'backfill',
> > # 'priority_weight': 10,
> > # 'end_date': datetime(2016, 1, 1),}
> >
> >
> > But specifically for tasks d, x, y , I have depends_on_past = true
> >
> > depends_on_past=True
> >
> >
> > So now:
> > For the first hour, d, x and y failed.
> > So I am assuming in the next hour these jobs should not be even tried?
> > right ?
> > But I see in the next hour and subsequent hours, these tasks are getting
> > triggered (and failing) ...
> > Should the behavior be : that if a tasks previous execution failed, no
> > attempt is made during the next run of dag?
> > Or am I doing something very "bad" here?
>
>
> What version are you on Harish?
>
>
Re: depends_on_past not working as expected?
Posted by Bolke de Bruin <bd...@gmail.com>.
> Op 13 mei 2016, om 22:19 heeft harish singh <ha...@gmail.com> het volgende geschreven:
>
> Hi guys,
>
> I am having an issue with making 'depends_on_past=true' work
>
> This my pipeline:
>
> a -> b -> c -> d -> e
>
> a -> x -> e
>
> a -> y -> e
>
> I have default args for all Tasks:
>
> scheduling_start_date = (datetime.utcnow() -
> datetime.timedelta(hours=1)).replace(minute=0, second=0,
> microsecond=0)
>
> default_args = {
> 'owner': 'airflow',
> 'depends_on_past': False,
> 'start_date': scheduling_start_date,
> 'email': ['airflow@airflow.com'],
> 'email_on_failure': False,
> 'email_on_retry': False,
> 'retries': 2,
> 'retry_delay': default_retries_delay,
> # 'queue': 'bash_queue',
> # 'pool': 'backfill',
> # 'priority_weight': 10,
> # 'end_date': datetime(2016, 1, 1),}
>
>
> But specifically for tasks d, x, y , I have depends_on_past = true
>
> depends_on_past=True
>
>
> So now:
> For the first hour, d, x and y failed.
> So I am assuming in the next hour these jobs should not be even tried?
> right ?
> But I see in the next hour and subsequent hours, these tasks are getting
> triggered (and failing) ...
> Should the behavior be : that if a tasks previous execution failed, no
> attempt is made during the next run of dag?
> Or am I doing something very "bad" here?
What version are you on Harish?