You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Pramiti Goel <pr...@gmail.com> on 2018/10/22 07:32:07 UTC

Using Too Many Aiflow Variables in Dag is Good thing ?

Hi,

We want to make owner and email Id general, so we don't want to put in
airflow dag. Using variables will help us in changing the email/owner
later, if there are lot of dags of same owner.

For example:


default_args = {
    'owner': Variable.get('test_owner_de'),
    'depends_on_past': False,
    'start_date': datetime(2018, 10, 17),
    'email': Variable.get('de_infra_email'),
    'email_on_failure': True,
    'email_on_retry': True,
    'retries': 2,
    'retry_delay': timedelta(minutes=1)}


Looking into the code of Airflow, it is making connection session everytime
the variable is created, and then close it. (Let me know if I understand
wrong). If there are many dags with variables in default args running
parallel, querying variable table in MySQL, will it have any sort of
limitation on number of sessions of SQLAlchemy ? Will that make dag slow as
there will be many queries to mysql for each dag? is the above approach
good ?

 >using Airlfow 1.9

Thanks,
Pramiti.

Re: Using Too Many Aiflow Variables in Dag is Good thing ?

Posted by Shah Altaf <me...@gmail.com>.

In my opinion this isn't a good thing - as you've observed it's making a
database connection each time. As the scheduler runs every 5 (?) seconds
that means the database is being queried frequently.  As the number of DAGs
grow you will see the number of connections growing as well.  We had a
similar situation with our Variable.get() calls - we had some in the
default_args and some as top level variables, so they kept getting
re-executed as the .py file was being scanned.  In just 20-30 DAGs we ran
over the maximum connection limit in RDS (t2.small), so we moved all
Variable.get() calls to as late as possible (inside methods/jinja
templates).

In the example you've shown it's worth converting it to environment
variables if you need some flexibility, the change management then shifts
to your deployment/configuration mechanism.  Or just hardcode it and
redeploy when needed.

On Mon, Oct 22, 2018 at 8:34 AM Pramiti Goel <pr...@gmail.com>
wrote:

> Hi,
>
> We want to make owner and email Id general, so we don't want to put in
> airflow dag. Using variables will help us in changing the email/owner
> later, if there are lot of dags of same owner.
>
> For example:
>
>
> default_args = {
>     'owner': Variable.get('test_owner_de'),
>     'depends_on_past': False,
>     'start_date': datetime(2018, 10, 17),
>     'email': Variable.get('de_infra_email'),
>     'email_on_failure': True,
>     'email_on_retry': True,
>     'retries': 2,
>     'retry_delay': timedelta(minutes=1)}
>
>
> Looking into the code of Airflow, it is making connection session everytime
> the variable is created, and then close it. (Let me know if I understand
> wrong). If there are many dags with variables in default args running
> parallel, querying variable table in MySQL, will it have any sort of
> limitation on number of sessions of SQLAlchemy ? Will that make dag slow as
> there will be many queries to mysql for each dag? is the above approach
> good ?
>
>  >using Airlfow 1.9
>
> Thanks,
> Pramiti.
>

Re: Using Too Many Aiflow Variables in Dag is Good thing ?

Posted by Sai Phanindhra <ph...@gmail.com>.

     We need to use something outside airflow ecosystem. For caching we can
still save values in memory or in file system. Since airflow is distributed
across multiple systems, above approach won't be much efficient. We need to
use caching solution outside airflow ecosystem. As long as its centralised
at single place and accessible to all components of airflow we can use it.

On Mon 22 Oct, 2018, 16:48 Sumit Maheshwari, <su...@gmail.com> wrote:

> >
> > On top of that we can expire the cache in order of few times of scheduler
> > runs(5 or 10 times one scheduler run time)
> >
>
> If you just want to use caching for a fixed amount of time and not in real
> time, i.e. where changing a variable from UI invalidates the cache
> immediately, then one easy way to a run a small bash task, say at every
> 15min frequency, which loads up all these variables into OS env variables,
> and then use those in your main DAGs. That way if you need to immediate
> reload your cache (env vars) after a change, you can do a manual run of
> that bash job as well.
>
>
> On Mon, Oct 22, 2018 at 4:33 PM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> > Redis is not a requirement of Airflow currently, nor should it become a
> > hard requirement either.
> >
> > Benchmarks definitely needed before we bring in anything as complex as a
> > cache, certainly.
> >
> > Queries to the variables table _should_ be fast too - even if it's got
> > 1000 rows in it that is tiny by RDBMS standards. If the problem is
> > connection set up and tear down times then we should find that out.
> >
> > > On 22 Oct 2018, at 11:59, Sai Phanindhra <ph...@gmail.com> wrote:
> > >
> > > On top of that we can expire the cache in order of few times of
> scheduler
> > > runs(5 or 10 times one scheduler run time)
> > >
> > > On Mon 22 Oct, 2018, 16:27 Sai Phanindhra, <ph...@gmail.com>
> wrote:
> > >
> > >> Thats true. But variable wont change very frequently.  We can cache
> > these
> > >> variables in some place outside airflow ecosystem. Something like
> redis
> > or
> > >> memcache. As queries to these dbs are fast. We can reduce the latency
> > and
> > >> decrease the number of connections to main database. This whole
> > assumption
> > >> need to be benchmarked to prove the point. I feel like its worth a
> try.
> > >>
> > >> On Mon 22 Oct, 2018, 15:47 Ash Berlin-Taylor, <as...@apache.org> wrote:
> > >>
> > >>> Cache them where? When would it get invalidated? Given the DAG
> parsing
> > >>> happens in a sub-process how would the cache live longer than that
> > process?
> > >>>
> > >>> I think the change might be to use a per-process/per-thread SQLA
> > >>> connection when parsing dags, so that if a DAG needs access to the
> > metadata
> > >>> DB it does it with just one connection rather than N.
> > >>>
> > >>> -ash
> > >>>
> > >>>> On 22 Oct 2018, at 11:11, Sai Phanindhra <ph...@gmail.com>
> wrote:
> > >>>>
> > >>>> Who don't we cache variables? We can fairly assume that variables
> > won't
> > >>> get
> > >>>> changed very frequently(not as frequent as scheduler DAG run time).
> We
> > >>> can
> > >>>> keep default timeout to few times scheduler run time. This will help
> > >>>> control number of connections to database and reduces load both on
> > >>>> scheduler and database.
> > >>>>
> > >>>> On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Hi
> > >>>>>
> > >>>>> You are right, it's a sure way to saturate db connections, as a
> > >>> connection
> > >>>>> is established every few seconds when the DAGs are parsed. The same
> > >>> happens
> > >>>>> when you use variables in __init__ of an operator. Os environment
> > >>> variable
> > >>>>> would be safer for your need.
> > >>>>>
> > >>>>> Marcin
> > >>>>>
> > >>>>>
> > >>>>> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pr...@gmail.com>
> > >>> wrote:
> > >>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> We want to make owner and email Id general, so we don't want to
> put
> > in
> > >>>>>> airflow dag. Using variables will help us in changing the
> > email/owner
> > >>>>>> later, if there are lot of dags of same owner.
> > >>>>>>
> > >>>>>> For example:
> > >>>>>>
> > >>>>>>
> > >>>>>> default_args = {
> > >>>>>>   'owner': Variable.get('test_owner_de'),
> > >>>>>>   'depends_on_past': False,
> > >>>>>>   'start_date': datetime(2018, 10, 17),
> > >>>>>>   'email': Variable.get('de_infra_email'),
> > >>>>>>   'email_on_failure': True,
> > >>>>>>   'email_on_retry': True,
> > >>>>>>   'retries': 2,
> > >>>>>>   'retry_delay': timedelta(minutes=1)}
> > >>>>>>
> > >>>>>>
> > >>>>>> Looking into the code of Airflow, it is making connection session
> > >>>>> everytime
> > >>>>>> the variable is created, and then close it. (Let me know if I
> > >>> understand
> > >>>>>> wrong). If there are many dags with variables in default args
> > running
> > >>>>>> parallel, querying variable table in MySQL, will it have any sort
> of
> > >>>>>> limitation on number of sessions of SQLAlchemy ? Will that make
> dag
> > >>> slow
> > >>>>> as
> > >>>>>> there will be many queries to mysql for each dag? is the above
> > >>> approach
> > >>>>>> good ?
> > >>>>>>
> > >>>>>>> using Airlfow 1.9
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Pramiti.
> > >>>>>>
> > >>>>>
> > >>>
> > >>>
> >
>

Re: Using Too Many Aiflow Variables in Dag is Good thing ?

Posted by Sumit Maheshwari <su...@gmail.com>.

>
> On top of that we can expire the cache in order of few times of scheduler
> runs(5 or 10 times one scheduler run time)
>

If you just want to use caching for a fixed amount of time and not in real
time, i.e. where changing a variable from UI invalidates the cache
immediately, then one easy way to a run a small bash task, say at every
15min frequency, which loads up all these variables into OS env variables,
and then use those in your main DAGs. That way if you need to immediate
reload your cache (env vars) after a change, you can do a manual run of
that bash job as well.


On Mon, Oct 22, 2018 at 4:33 PM Ash Berlin-Taylor <as...@apache.org> wrote:

> Redis is not a requirement of Airflow currently, nor should it become a
> hard requirement either.
>
> Benchmarks definitely needed before we bring in anything as complex as a
> cache, certainly.
>
> Queries to the variables table _should_ be fast too - even if it's got
> 1000 rows in it that is tiny by RDBMS standards. If the problem is
> connection set up and tear down times then we should find that out.
>
> > On 22 Oct 2018, at 11:59, Sai Phanindhra <ph...@gmail.com> wrote:
> >
> > On top of that we can expire the cache in order of few times of scheduler
> > runs(5 or 10 times one scheduler run time)
> >
> > On Mon 22 Oct, 2018, 16:27 Sai Phanindhra, <ph...@gmail.com> wrote:
> >
> >> Thats true. But variable wont change very frequently.  We can cache
> these
> >> variables in some place outside airflow ecosystem. Something like redis
> or
> >> memcache. As queries to these dbs are fast. We can reduce the latency
> and
> >> decrease the number of connections to main database. This whole
> assumption
> >> need to be benchmarked to prove the point. I feel like its worth a try.
> >>
> >> On Mon 22 Oct, 2018, 15:47 Ash Berlin-Taylor, <as...@apache.org> wrote:
> >>
> >>> Cache them where? When would it get invalidated? Given the DAG parsing
> >>> happens in a sub-process how would the cache live longer than that
> process?
> >>>
> >>> I think the change might be to use a per-process/per-thread SQLA
> >>> connection when parsing dags, so that if a DAG needs access to the
> metadata
> >>> DB it does it with just one connection rather than N.
> >>>
> >>> -ash
> >>>
> >>>> On 22 Oct 2018, at 11:11, Sai Phanindhra <ph...@gmail.com> wrote:
> >>>>
> >>>> Who don't we cache variables? We can fairly assume that variables
> won't
> >>> get
> >>>> changed very frequently(not as frequent as scheduler DAG run time). We
> >>> can
> >>>> keep default timeout to few times scheduler run time. This will help
> >>>> control number of connections to database and reduces load both on
> >>>> scheduler and database.
> >>>>
> >>>> On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi
> >>>>>
> >>>>> You are right, it's a sure way to saturate db connections, as a
> >>> connection
> >>>>> is established every few seconds when the DAGs are parsed. The same
> >>> happens
> >>>>> when you use variables in __init__ of an operator. Os environment
> >>> variable
> >>>>> would be safer for your need.
> >>>>>
> >>>>> Marcin
> >>>>>
> >>>>>
> >>>>> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pr...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> We want to make owner and email Id general, so we don't want to put
> in
> >>>>>> airflow dag. Using variables will help us in changing the
> email/owner
> >>>>>> later, if there are lot of dags of same owner.
> >>>>>>
> >>>>>> For example:
> >>>>>>
> >>>>>>
> >>>>>> default_args = {
> >>>>>>   'owner': Variable.get('test_owner_de'),
> >>>>>>   'depends_on_past': False,
> >>>>>>   'start_date': datetime(2018, 10, 17),
> >>>>>>   'email': Variable.get('de_infra_email'),
> >>>>>>   'email_on_failure': True,
> >>>>>>   'email_on_retry': True,
> >>>>>>   'retries': 2,
> >>>>>>   'retry_delay': timedelta(minutes=1)}
> >>>>>>
> >>>>>>
> >>>>>> Looking into the code of Airflow, it is making connection session
> >>>>> everytime
> >>>>>> the variable is created, and then close it. (Let me know if I
> >>> understand
> >>>>>> wrong). If there are many dags with variables in default args
> running
> >>>>>> parallel, querying variable table in MySQL, will it have any sort of
> >>>>>> limitation on number of sessions of SQLAlchemy ? Will that make dag
> >>> slow
> >>>>> as
> >>>>>> there will be many queries to mysql for each dag? is the above
> >>> approach
> >>>>>> good ?
> >>>>>>
> >>>>>>> using Airlfow 1.9
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Pramiti.
> >>>>>>
> >>>>>
> >>>
> >>>
>

Re: Using Too Many Aiflow Variables in Dag is Good thing ?

Posted by Ash Berlin-Taylor <as...@apache.org>.

Redis is not a requirement of Airflow currently, nor should it become a hard requirement either.

Benchmarks definitely needed before we bring in anything as complex as a cache, certainly.

Queries to the variables table _should_ be fast too - even if it's got 1000 rows in it that is tiny by RDBMS standards. If the problem is connection set up and tear down times then we should find that out.

> On 22 Oct 2018, at 11:59, Sai Phanindhra <ph...@gmail.com> wrote:
> 
> On top of that we can expire the cache in order of few times of scheduler
> runs(5 or 10 times one scheduler run time)
> 
> On Mon 22 Oct, 2018, 16:27 Sai Phanindhra, <ph...@gmail.com> wrote:
> 
>> Thats true. But variable wont change very frequently.  We can cache these
>> variables in some place outside airflow ecosystem. Something like redis or
>> memcache. As queries to these dbs are fast. We can reduce the latency and
>> decrease the number of connections to main database. This whole assumption
>> need to be benchmarked to prove the point. I feel like its worth a try.
>> 
>> On Mon 22 Oct, 2018, 15:47 Ash Berlin-Taylor, <as...@apache.org> wrote:
>> 
>>> Cache them where? When would it get invalidated? Given the DAG parsing
>>> happens in a sub-process how would the cache live longer than that process?
>>> 
>>> I think the change might be to use a per-process/per-thread SQLA
>>> connection when parsing dags, so that if a DAG needs access to the metadata
>>> DB it does it with just one connection rather than N.
>>> 
>>> -ash
>>> 
>>>> On 22 Oct 2018, at 11:11, Sai Phanindhra <ph...@gmail.com> wrote:
>>>> 
>>>> Who don't we cache variables? We can fairly assume that variables won't
>>> get
>>>> changed very frequently(not as frequent as scheduler DAG run time). We
>>> can
>>>> keep default timeout to few times scheduler run time. This will help
>>>> control number of connections to database and reduces load both on
>>>> scheduler and database.
>>>> 
>>>> On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms...@gmail.com> wrote:
>>>> 
>>>>> Hi
>>>>> 
>>>>> You are right, it's a sure way to saturate db connections, as a
>>> connection
>>>>> is established every few seconds when the DAGs are parsed. The same
>>> happens
>>>>> when you use variables in __init__ of an operator. Os environment
>>> variable
>>>>> would be safer for your need.
>>>>> 
>>>>> Marcin
>>>>> 
>>>>> 
>>>>> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pr...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> We want to make owner and email Id general, so we don't want to put in
>>>>>> airflow dag. Using variables will help us in changing the email/owner
>>>>>> later, if there are lot of dags of same owner.
>>>>>> 
>>>>>> For example:
>>>>>> 
>>>>>> 
>>>>>> default_args = {
>>>>>>   'owner': Variable.get('test_owner_de'),
>>>>>>   'depends_on_past': False,
>>>>>>   'start_date': datetime(2018, 10, 17),
>>>>>>   'email': Variable.get('de_infra_email'),
>>>>>>   'email_on_failure': True,
>>>>>>   'email_on_retry': True,
>>>>>>   'retries': 2,
>>>>>>   'retry_delay': timedelta(minutes=1)}
>>>>>> 
>>>>>> 
>>>>>> Looking into the code of Airflow, it is making connection session
>>>>> everytime
>>>>>> the variable is created, and then close it. (Let me know if I
>>> understand
>>>>>> wrong). If there are many dags with variables in default args running
>>>>>> parallel, querying variable table in MySQL, will it have any sort of
>>>>>> limitation on number of sessions of SQLAlchemy ? Will that make dag
>>> slow
>>>>> as
>>>>>> there will be many queries to mysql for each dag? is the above
>>> approach
>>>>>> good ?
>>>>>> 
>>>>>>> using Airlfow 1.9
>>>>>> 
>>>>>> Thanks,
>>>>>> Pramiti.
>>>>>> 
>>>>> 
>>> 
>>>

Re: Using Too Many Aiflow Variables in Dag is Good thing ?

Posted by Sai Phanindhra <ph...@gmail.com>.

On top of that we can expire the cache in order of few times of scheduler
runs(5 or 10 times one scheduler run time)

On Mon 22 Oct, 2018, 16:27 Sai Phanindhra, <ph...@gmail.com> wrote:

> Thats true. But variable wont change very frequently.  We can cache these
> variables in some place outside airflow ecosystem. Something like redis or
> memcache. As queries to these dbs are fast. We can reduce the latency and
> decrease the number of connections to main database. This whole assumption
> need to be benchmarked to prove the point. I feel like its worth a try.
>
> On Mon 22 Oct, 2018, 15:47 Ash Berlin-Taylor, <as...@apache.org> wrote:
>
>> Cache them where? When would it get invalidated? Given the DAG parsing
>> happens in a sub-process how would the cache live longer than that process?
>>
>> I think the change might be to use a per-process/per-thread SQLA
>> connection when parsing dags, so that if a DAG needs access to the metadata
>> DB it does it with just one connection rather than N.
>>
>> -ash
>>
>> > On 22 Oct 2018, at 11:11, Sai Phanindhra <ph...@gmail.com> wrote:
>> >
>> > Who don't we cache variables? We can fairly assume that variables won't
>> get
>> > changed very frequently(not as frequent as scheduler DAG run time). We
>> can
>> > keep default timeout to few times scheduler run time. This will help
>> > control number of connections to database and reduces load both on
>> > scheduler and database.
>> >
>> > On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms...@gmail.com> wrote:
>> >
>> >> Hi
>> >>
>> >> You are right, it's a sure way to saturate db connections, as a
>> connection
>> >> is established every few seconds when the DAGs are parsed. The same
>> happens
>> >> when you use variables in __init__ of an operator. Os environment
>> variable
>> >> would be safer for your need.
>> >>
>> >> Marcin
>> >>
>> >>
>> >> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pr...@gmail.com>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> We want to make owner and email Id general, so we don't want to put in
>> >>> airflow dag. Using variables will help us in changing the email/owner
>> >>> later, if there are lot of dags of same owner.
>> >>>
>> >>> For example:
>> >>>
>> >>>
>> >>> default_args = {
>> >>>    'owner': Variable.get('test_owner_de'),
>> >>>    'depends_on_past': False,
>> >>>    'start_date': datetime(2018, 10, 17),
>> >>>    'email': Variable.get('de_infra_email'),
>> >>>    'email_on_failure': True,
>> >>>    'email_on_retry': True,
>> >>>    'retries': 2,
>> >>>    'retry_delay': timedelta(minutes=1)}
>> >>>
>> >>>
>> >>> Looking into the code of Airflow, it is making connection session
>> >> everytime
>> >>> the variable is created, and then close it. (Let me know if I
>> understand
>> >>> wrong). If there are many dags with variables in default args running
>> >>> parallel, querying variable table in MySQL, will it have any sort of
>> >>> limitation on number of sessions of SQLAlchemy ? Will that make dag
>> slow
>> >> as
>> >>> there will be many queries to mysql for each dag? is the above
>> approach
>> >>> good ?
>> >>>
>> >>>> using Airlfow 1.9
>> >>>
>> >>> Thanks,
>> >>> Pramiti.
>> >>>
>> >>
>>
>>

Re: Using Too Many Aiflow Variables in Dag is Good thing ?

Posted by Sai Phanindhra <ph...@gmail.com>.

Thats true. But variable wont change very frequently.  We can cache these
variables in some place outside airflow ecosystem. Something like redis or
memcache. As queries to these dbs are fast. We can reduce the latency and
decrease the number of connections to main database. This whole assumption
need to be benchmarked to prove the point. I feel like its worth a try.

On Mon 22 Oct, 2018, 15:47 Ash Berlin-Taylor, <as...@apache.org> wrote:

> Cache them where? When would it get invalidated? Given the DAG parsing
> happens in a sub-process how would the cache live longer than that process?
>
> I think the change might be to use a per-process/per-thread SQLA
> connection when parsing dags, so that if a DAG needs access to the metadata
> DB it does it with just one connection rather than N.
>
> -ash
>
> > On 22 Oct 2018, at 11:11, Sai Phanindhra <ph...@gmail.com> wrote:
> >
> > Who don't we cache variables? We can fairly assume that variables won't
> get
> > changed very frequently(not as frequent as scheduler DAG run time). We
> can
> > keep default timeout to few times scheduler run time. This will help
> > control number of connections to database and reduces load both on
> > scheduler and database.
> >
> > On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms...@gmail.com> wrote:
> >
> >> Hi
> >>
> >> You are right, it's a sure way to saturate db connections, as a
> connection
> >> is established every few seconds when the DAGs are parsed. The same
> happens
> >> when you use variables in __init__ of an operator. Os environment
> variable
> >> would be safer for your need.
> >>
> >> Marcin
> >>
> >>
> >> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pr...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> We want to make owner and email Id general, so we don't want to put in
> >>> airflow dag. Using variables will help us in changing the email/owner
> >>> later, if there are lot of dags of same owner.
> >>>
> >>> For example:
> >>>
> >>>
> >>> default_args = {
> >>>    'owner': Variable.get('test_owner_de'),
> >>>    'depends_on_past': False,
> >>>    'start_date': datetime(2018, 10, 17),
> >>>    'email': Variable.get('de_infra_email'),
> >>>    'email_on_failure': True,
> >>>    'email_on_retry': True,
> >>>    'retries': 2,
> >>>    'retry_delay': timedelta(minutes=1)}
> >>>
> >>>
> >>> Looking into the code of Airflow, it is making connection session
> >> everytime
> >>> the variable is created, and then close it. (Let me know if I
> understand
> >>> wrong). If there are many dags with variables in default args running
> >>> parallel, querying variable table in MySQL, will it have any sort of
> >>> limitation on number of sessions of SQLAlchemy ? Will that make dag
> slow
> >> as
> >>> there will be many queries to mysql for each dag? is the above approach
> >>> good ?
> >>>
> >>>> using Airlfow 1.9
> >>>
> >>> Thanks,
> >>> Pramiti.
> >>>
> >>
>
>

Re: Using Too Many Aiflow Variables in Dag is Good thing ?

Posted by Ash Berlin-Taylor <as...@apache.org>.

Cache them where? When would it get invalidated? Given the DAG parsing happens in a sub-process how would the cache live longer than that process?

I think the change might be to use a per-process/per-thread SQLA connection when parsing dags, so that if a DAG needs access to the metadata DB it does it with just one connection rather than N.

-ash

> On 22 Oct 2018, at 11:11, Sai Phanindhra <ph...@gmail.com> wrote:
> 
> Who don't we cache variables? We can fairly assume that variables won't get
> changed very frequently(not as frequent as scheduler DAG run time). We can
> keep default timeout to few times scheduler run time. This will help
> control number of connections to database and reduces load both on
> scheduler and database.
> 
> On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms...@gmail.com> wrote:
> 
>> Hi
>> 
>> You are right, it's a sure way to saturate db connections, as a connection
>> is established every few seconds when the DAGs are parsed. The same happens
>> when you use variables in __init__ of an operator. Os environment variable
>> would be safer for your need.
>> 
>> Marcin
>> 
>> 
>> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pr...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> We want to make owner and email Id general, so we don't want to put in
>>> airflow dag. Using variables will help us in changing the email/owner
>>> later, if there are lot of dags of same owner.
>>> 
>>> For example:
>>> 
>>> 
>>> default_args = {
>>>    'owner': Variable.get('test_owner_de'),
>>>    'depends_on_past': False,
>>>    'start_date': datetime(2018, 10, 17),
>>>    'email': Variable.get('de_infra_email'),
>>>    'email_on_failure': True,
>>>    'email_on_retry': True,
>>>    'retries': 2,
>>>    'retry_delay': timedelta(minutes=1)}
>>> 
>>> 
>>> Looking into the code of Airflow, it is making connection session
>> everytime
>>> the variable is created, and then close it. (Let me know if I understand
>>> wrong). If there are many dags with variables in default args running
>>> parallel, querying variable table in MySQL, will it have any sort of
>>> limitation on number of sessions of SQLAlchemy ? Will that make dag slow
>> as
>>> there will be many queries to mysql for each dag? is the above approach
>>> good ?
>>> 
>>>> using Airlfow 1.9
>>> 
>>> Thanks,
>>> Pramiti.
>>> 
>>

Re: Using Too Many Aiflow Variables in Dag is Good thing ?

Posted by Sai Phanindhra <ph...@gmail.com>.

Who don't we cache variables? We can fairly assume that variables won't get
changed very frequently(not as frequent as scheduler DAG run time). We can
keep default timeout to few times scheduler run time. This will help
control number of connections to database and reduces load both on
scheduler and database.

On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms...@gmail.com> wrote:

> Hi
>
> You are right, it's a sure way to saturate db connections, as a connection
> is established every few seconds when the DAGs are parsed. The same happens
> when you use variables in __init__ of an operator. Os environment variable
> would be safer for your need.
>
> Marcin
>
>
> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pr...@gmail.com> wrote:
>
> > Hi,
> >
> > We want to make owner and email Id general, so we don't want to put in
> > airflow dag. Using variables will help us in changing the email/owner
> > later, if there are lot of dags of same owner.
> >
> > For example:
> >
> >
> > default_args = {
> >     'owner': Variable.get('test_owner_de'),
> >     'depends_on_past': False,
> >     'start_date': datetime(2018, 10, 17),
> >     'email': Variable.get('de_infra_email'),
> >     'email_on_failure': True,
> >     'email_on_retry': True,
> >     'retries': 2,
> >     'retry_delay': timedelta(minutes=1)}
> >
> >
> > Looking into the code of Airflow, it is making connection session
> everytime
> > the variable is created, and then close it. (Let me know if I understand
> > wrong). If there are many dags with variables in default args running
> > parallel, querying variable table in MySQL, will it have any sort of
> > limitation on number of sessions of SQLAlchemy ? Will that make dag slow
> as
> > there will be many queries to mysql for each dag? is the above approach
> > good ?
> >
> >  >using Airlfow 1.9
> >
> > Thanks,
> > Pramiti.
> >
>

Re: Using Too Many Aiflow Variables in Dag is Good thing ?

Posted by Marcin Szymański <ms...@gmail.com>.

Hi

You are right, it's a sure way to saturate db connections, as a connection
is established every few seconds when the DAGs are parsed. The same happens
when you use variables in __init__ of an operator. Os environment variable
would be safer for your need.

Marcin


On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pr...@gmail.com> wrote:

> Hi,
>
> We want to make owner and email Id general, so we don't want to put in
> airflow dag. Using variables will help us in changing the email/owner
> later, if there are lot of dags of same owner.
>
> For example:
>
>
> default_args = {
>     'owner': Variable.get('test_owner_de'),
>     'depends_on_past': False,
>     'start_date': datetime(2018, 10, 17),
>     'email': Variable.get('de_infra_email'),
>     'email_on_failure': True,
>     'email_on_retry': True,
>     'retries': 2,
>     'retry_delay': timedelta(minutes=1)}
>
>
> Looking into the code of Airflow, it is making connection session everytime
> the variable is created, and then close it. (Let me know if I understand
> wrong). If there are many dags with variables in default args running
> parallel, querying variable table in MySQL, will it have any sort of
> limitation on number of sessions of SQLAlchemy ? Will that make dag slow as
> there will be many queries to mysql for each dag? is the above approach
> good ?
>
>  >using Airlfow 1.9
>
> Thanks,
> Pramiti.
>