You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by "Daniel (Daniel Lamblin) [BDP - Seoul]" <la...@coupang.com> on 2018/05/30 01:19:37 UTC

conn_id breaking change; once more with feeling

The short of this email is: can we please name all the connection id named parameters to all hooks and operators as similarly as possible. EG just `conn_id`?

So, when we started using Airflow I _had_ thought that minor versions would be compatible for a user's DAG, assuming no use of anything marked beta or deprecated, such that v1.7.1, 1.8.0, 1.8.1, 1.8.2 and 1.9.0 etc would all run dags from prior versions in that linage, each with possible stability and feature improvements and each with possibly more operators, hooks, executors (even) etc.

This is (now) obviously not the case, and it's the group's choice about what should and what should not break on a release-by-release basis. I think a clear policy would be appropriate for full Apache status, but then I may have missed where the policy is defined. Though, if defined as not giving stability to the user's dags for most version changes isn't in my opinion going to grow confidence for Airflow being something you can grow with.

Not to be overly dramatic, but currently the tiny change that the `s3Hook(…)` takes `aws_conn_id='any_string'` vs `s3_conn_id='still_any_string'` means that basically I have to maintain a 1.8.2 setup in perpetuity, because no one (here) wants to briefly code freeze before during and after an update so that we can update all the uses and be ready to roll back the update if something else breaks (also some people access the not-actually-private `_a_key and _s_key` and would need to switch to still-not-private `_get_credentials()[…]`). Instead we'll just run a second airflow setup at 1.9.0 (in each and every staged environment) and move the 8k dags piecemeal whe[never] people get the spare time and inclination. I mean, we might. It looks like 1.10.0 changes some config around s3 logging one more time… so maybe it's better to skip it?

I saw the couple of PRs where the project itself had to make the changes to usages of the named field. There was momentary and passing concern that users' dags would need to do the same. In the PRs, of the options discussed (shims, supporting the deprecated name as deprecated, doing a hard rename), it wasn't brought up if the rename to aws_conn_id was the best name. [Was this discussed elsewhere?]

And so I wondered why is there this virtual Hungarian notation on all the `conn_id`s?
A hook generally takes one `conn_id`, most operators take only one. In these cases couldn't the named parameter have been `conn_id`? When an operator needs a couple conn_ids, could it not have `src_conn_id` and `dest_conn_id` instead of locking in either s3 or aws, mysql or maria, dataflow or beam etc? Those are hypotheticals, I believe.

Could I propose to plan to break absolutely all uses of `{named}_conn_id`s, before or by version 2.0, with an eye towards never again having to break a named parameter for the rest of 2.x's life? There's probably more named parameters that should be fixed per major release, and if you agree, some work should be done to identify them all, but this just happens to be the one that seems most likely to change often, and has so recently.

Thanks,
-Daniel

Re: conn_id breaking change; once more with feeling

Posted by Maxime Beauchemin <ma...@gmail.com>.

> Breaking changes could be maintained via deprecation warnings for a
number of releases to avoid deterring users, whilst pushing towards a
cleaner interface.

No one wants to go and alter hundreds of DAGs, thousands of operator calls.
I know for a fact that the task would be monumental at both Lyft and Airbnb
and could lead to upgrades never happening, or cause many incidents (change
causes incidents!). It also mean a maintainer committing to changing all
operators to support both, handing deprecation warning, and cleaning up
eventually, I don't think you'll find someone to sign up for that task.
There are also a lot of operators in the wild (living in pipeline repos)
that may just never conform.

For operators specifically, as mentioned earlier in the thread, the reason
why it is the way it is is for `default_args` to set all the conn_ids at
the top of the script. Personally I feel strongly about not breaking this.
It makes DAG scripts look nice, and changing would be hard and very
disruptive (all DAGs in existence would have to adapt and would end up
looking worse by repeating conn_ids for each task instead of a global
setting).

As for hooks, why not just formalizing the fact that the first argument of
the constructor is a `conn_id` and using positional arguments when
instantiating hooks? There are generic operators out there that built upon
this assumption already.

Max



On Fri, Jun 29, 2018 at 9:04 AM julianderuiter@gmail.com <
julianderuiter@gmail.com> wrote:

> I would be very much in favour of moving to the naming approach you
> propose (conn_id for hooks, src_conn_id and dest_conn_id for operators with
> multiple connections), which I think is much more consistent than the
> current naming conventions. The added advantage of this naming is also that
> it makes it much easier in the future to work towards more generic
> operators/hooks where we (for example) copy from one database or file
> system without caring which file systems are involved. This avoids the
> wildfire of the current Airflow codebase in which we end up with an
> operator for every different combination of file systems.
>
> Breaking changes could be maintained via deprecation warnings for a number
> of releases to avoid deterring users, whilst pushing towards a cleaner
> interface.
>
> On 2018/05/30 01:19:37, "Daniel (Daniel Lamblin) [BDP - Seoul]" <
> lamblin@coupang.com> wrote:
> > The short of this email is: can we please name all the connection id
> named parameters to all hooks and operators as similarly as possible. EG
> just `conn_id`?
> >
> > So, when we started using Airflow I _had_ thought that minor versions
> would be compatible for a user's DAG, assuming no use of anything marked
> beta or deprecated, such that v1.7.1, 1.8.0, 1.8.1, 1.8.2 and 1.9.0 etc
> would all run dags from prior versions in that linage, each with possible
> stability and feature improvements and each with possibly more operators,
> hooks, executors (even) etc.
> >
> > This is (now) obviously not the case, and it's the group's choice about
> what should and what should not break on a release-by-release basis. I
> think a clear policy would be appropriate for full Apache status, but then
> I may have missed where the policy is defined.  Though, if defined as not
> giving stability to the user's dags for most version changes isn't in my
> opinion going to grow confidence for Airflow being something you can grow
> with.
> >
> > Not to be overly dramatic, but currently the tiny change that the
> `s3Hook(…)` takes `aws_conn_id='any_string'` vs
> `s3_conn_id='still_any_string'` means that basically I have to maintain a
> 1.8.2 setup in perpetuity, because no one (here) wants to briefly code
> freeze before during and after an update so that we can update all the uses
> and be ready to roll back the update if something else breaks (also some
> people access the not-actually-private `_a_key and _s_key` and would need
> to switch to still-not-private `_get_credentials()[…]`). Instead we'll just
> run a second airflow setup at 1.9.0 (in each and every staged environment)
> and move the 8k dags piecemeal whe[never] people get the spare time and
> inclination. I mean, we might. It looks like 1.10.0 changes some config
> around s3 logging one more time… so maybe it's better to skip it?
> >
> > I saw the couple of PRs where the project itself had to make the changes
> to usages of the named field. There was momentary and passing concern that
> users' dags would need to do the same. In the PRs, of the options discussed
> (shims, supporting the deprecated name as deprecated, doing a hard rename),
> it wasn't brought up if the rename to aws_conn_id was the best name. [Was
> this discussed elsewhere?]
> >
> > And so I wondered why is there this virtual Hungarian notation on all
> the `conn_id`s?
> > A hook generally takes one `conn_id`, most operators take only one. In
> these cases couldn't the named parameter have been `conn_id`? When an
> operator needs a couple conn_ids, could it not have `src_conn_id` and
> `dest_conn_id` instead of locking in either s3 or aws, mysql or maria,
> dataflow or beam etc? Those are hypotheticals, I believe.
> >
> > Could I propose to plan to break absolutely all uses of
> `{named}_conn_id`s, before or by version 2.0, with an eye towards never
> again having to break a named parameter for the rest of 2.x's life? There's
> probably more named parameters that should be fixed per major release, and
> if you agree, some work should be done to identify them all, but this just
> happens to be the one that seems most likely to change often, and has so
> recently.
> >
> > Thanks,
> > -Daniel
> >
> >
>

Re: conn_id breaking change; once more with feeling

Posted by ju...@gmail.com, ju...@gmail.com.

I would be very much in favour of moving to the naming approach you propose (conn_id for hooks, src_conn_id and dest_conn_id for operators with multiple connections), which I think is much more consistent than the current naming conventions. The added advantage of this naming is also that it makes it much easier in the future to work towards more generic operators/hooks where we (for example) copy from one database or file system without caring which file systems are involved. This avoids the wildfire of the current Airflow codebase in which we end up with an operator for every different combination of file systems. 

Breaking changes could be maintained via deprecation warnings for a number of releases to avoid deterring users, whilst pushing towards a cleaner interface.

On 2018/05/30 01:19:37, "Daniel (Daniel Lamblin) [BDP - Seoul]" <la...@coupang.com> wrote: 
> The short of this email is: can we please name all the connection id named parameters to all hooks and operators as similarly as possible. EG just `conn_id`?
> 
> So, when we started using Airflow I _had_ thought that minor versions would be compatible for a user's DAG, assuming no use of anything marked beta or deprecated, such that v1.7.1, 1.8.0, 1.8.1, 1.8.2 and 1.9.0 etc would all run dags from prior versions in that linage, each with possible stability and feature improvements and each with possibly more operators, hooks, executors (even) etc.
> 
> This is (now) obviously not the case, and it's the group's choice about what should and what should not break on a release-by-release basis. I think a clear policy would be appropriate for full Apache status, but then I may have missed where the policy is defined.  Though, if defined as not giving stability to the user's dags for most version changes isn't in my opinion going to grow confidence for Airflow being something you can grow with.
> 
> Not to be overly dramatic, but currently the tiny change that the `s3Hook(…)` takes `aws_conn_id='any_string'` vs `s3_conn_id='still_any_string'` means that basically I have to maintain a 1.8.2 setup in perpetuity, because no one (here) wants to briefly code freeze before during and after an update so that we can update all the uses and be ready to roll back the update if something else breaks (also some people access the not-actually-private `_a_key and _s_key` and would need to switch to still-not-private `_get_credentials()[…]`). Instead we'll just run a second airflow setup at 1.9.0 (in each and every staged environment) and move the 8k dags piecemeal whe[never] people get the spare time and inclination. I mean, we might. It looks like 1.10.0 changes some config around s3 logging one more time… so maybe it's better to skip it?
> 
> I saw the couple of PRs where the project itself had to make the changes to usages of the named field. There was momentary and passing concern that users' dags would need to do the same. In the PRs, of the options discussed (shims, supporting the deprecated name as deprecated, doing a hard rename), it wasn't brought up if the rename to aws_conn_id was the best name. [Was this discussed elsewhere?]
> 
> And so I wondered why is there this virtual Hungarian notation on all the `conn_id`s?
> A hook generally takes one `conn_id`, most operators take only one. In these cases couldn't the named parameter have been `conn_id`? When an operator needs a couple conn_ids, could it not have `src_conn_id` and `dest_conn_id` instead of locking in either s3 or aws, mysql or maria, dataflow or beam etc? Those are hypotheticals, I believe.
> 
> Could I propose to plan to break absolutely all uses of `{named}_conn_id`s, before or by version 2.0, with an eye towards never again having to break a named parameter for the rest of 2.x's life? There's probably more named parameters that should be fixed per major release, and if you agree, some work should be done to identify them all, but this just happens to be the one that seems most likely to change often, and has so recently.
> 
> Thanks,
> -Daniel
> 
>

Re: conn_id breaking change; once more with feeling

Posted by Ash Berlin-Taylor <as...@firemirror.com>.

I was involved in the Github discussion about the rename to aws_conn_id, and it prompted me to write http://mail-archives.apache.org/mod_mbox/airflow-dev/201801.mbox/%3cCABYbY7dPS8X6Z4mgbahevQwF5BnYYHXezFo=avoLBNxPzp5=BA@mail.gmail.com%3e <http://mail-archives.apache.org/mod_mbox/airflow-dev/201801.mbox/%3CCABYbY7dPS8X6Z4mgbahevQwF5BnYYHXezFo=avoLBNxPzp5=BA@mail.gmail.com%3E> (well, the thing Chris is quoting, for some reason I can't find my original mail in the archives)

But I never got around to writing up the mentioned proposal. :(

Does anyone have any further thoughts on the discussion?


> On 30 May 2018, at 06:43, Maxime Beauchemin <ma...@gmail.com> wrote:
> 
> The main reason for the conn_id prefix is to facilitate the use of
> `default_args`. Because of this you can set all your connections at the top
> of your script and from that point on you just instantiate tasks without
> re-stating connections. It's common for people to define multiple
> "operating contexts" (production/staging/dev) where default_args and
> connections are defined conditionally based on some env vars and having
> different names for different conn_ids is key in that case.
> 
> Also as you mentioned many operators (all data transfers) take 2 conn_ids
> which would require prefixing anyways.
> 
> And yes, minor releases should never invalidate DAGs except for rare
> exceptions (say a new feature that is still pre-release, or never worked
> properly in previous release for some reason). Breaking changes should come
> with an UPDATE.md message aligned with the release. Pretty names doesn't
> justify breaking stuff and forcing people to grep and mass replace things.
> If someone wants a prettier name for an argument or anything else that's in
> the "obviously-public API" (DAG objects, operators, setting deps, ...) they
> should need to make the change backward compatible and start
> logging.warning() about next major release deprecation.
> 
> I also think 2.0 should be as mild as possible on backward incompatible
> changes or come with a compatibility layer (from airflow import LegacyDAG
> as DAG). No one wants to go and mass update tons of scripts.
> 
> It should be the case too for the less-obviously public API (hooks, methods
> not prefixed with `_` or `__`), though I think it may be reasonable in some
> cases (say a method that really should have been defined as private) but
> avoided as much as possible.
> 
> *Committers*, let's be vigilant about this. Breaking backward compatibility
> is a big deal. An important part of code review is identifying backward
> incompatible changes.
> 
> Max
> 
> On Tue, May 29, 2018 at 6:19 PM Daniel (Daniel Lamblin) [BDP - Seoul] <
> lamblin@coupang.com> wrote:
> 
>> The short of this email is: can we please name all the connection id named
>> parameters to all hooks and operators as similarly as possible. EG just
>> `conn_id`?
>> 
>> So, when we started using Airflow I _had_ thought that minor versions
>> would be compatible for a user's DAG, assuming no use of anything marked
>> beta or deprecated, such that v1.7.1, 1.8.0, 1.8.1, 1.8.2 and 1.9.0 etc
>> would all run dags from prior versions in that linage, each with possible
>> stability and feature improvements and each with possibly more operators,
>> hooks, executors (even) etc.
>> 
>> This is (now) obviously not the case, and it's the group's choice about
>> what should and what should not break on a release-by-release basis. I
>> think a clear policy would be appropriate for full Apache status, but then
>> I may have missed where the policy is defined.  Though, if defined as not
>> giving stability to the user's dags for most version changes isn't in my
>> opinion going to grow confidence for Airflow being something you can grow
>> with.
>> 
>> Not to be overly dramatic, but currently the tiny change that the
>> `s3Hook(…)` takes `aws_conn_id='any_string'` vs
>> `s3_conn_id='still_any_string'` means that basically I have to maintain a
>> 1.8.2 setup in perpetuity, because no one (here) wants to briefly code
>> freeze before during and after an update so that we can update all the uses
>> and be ready to roll back the update if something else breaks (also some
>> people access the not-actually-private `_a_key and _s_key` and would need
>> to switch to still-not-private `_get_credentials()[…]`). Instead we'll just
>> run a second airflow setup at 1.9.0 (in each and every staged environment)
>> and move the 8k dags piecemeal whe[never] people get the spare time and
>> inclination. I mean, we might. It looks like 1.10.0 changes some config
>> around s3 logging one more time… so maybe it's better to skip it?
>> 
>> I saw the couple of PRs where the project itself had to make the changes
>> to usages of the named field. There was momentary and passing concern that
>> users' dags would need to do the same. In the PRs, of the options discussed
>> (shims, supporting the deprecated name as deprecated, doing a hard rename),
>> it wasn't brought up if the rename to aws_conn_id was the best name. [Was
>> this discussed elsewhere?]
>> 
>> And so I wondered why is there this virtual Hungarian notation on all the
>> `conn_id`s?
>> A hook generally takes one `conn_id`, most operators take only one. In
>> these cases couldn't the named parameter have been `conn_id`? When an
>> operator needs a couple conn_ids, could it not have `src_conn_id` and
>> `dest_conn_id` instead of locking in either s3 or aws, mysql or maria,
>> dataflow or beam etc? Those are hypotheticals, I believe.
>> 
>> Could I propose to plan to break absolutely all uses of
>> `{named}_conn_id`s, before or by version 2.0, with an eye towards never
>> again having to break a named parameter for the rest of 2.x's life? There's
>> probably more named parameters that should be fixed per major release, and
>> if you agree, some work should be done to identify them all, but this just
>> happens to be the one that seems most likely to change often, and has so
>> recently.
>> 
>> Thanks,
>> -Daniel
>> 
>>

Re: conn_id breaking change; once more with feeling

Posted by "Daniel (Daniel Lamblin) [BDP - Seoul]" <la...@coupang.com>.

Thanks Maxime & Ash, 
So, I'm glad to hear that minor releases should never invalidate DAGs. With rare exceptions being possible.

I like the idea that this is (or was going to be) written up on the wiki as a policy. In it I suppose defining what a minor version is, would help. (I think it's Major.Minor.Point but I could be wrong). I would go further in removing the caveat about rare exceptions. I think ALL policies are implicitly able to be over-ridden when there's rare exceptions necessary and where there's discussion, and consensus about the need. These aren't bylaws overseen by a regulatory body. Or have I underestimated the Apache process? The wiki document should probably be organized as part of the first things to know in a contributor's guide, and the guide should be referenced from the main docs in the project section, perhaps near Roadmap. It may be worth explaining the promise and the meaning of deprecation, and experimental to the end users in the doc's FAQs too. 

The specific aws/s3 change was documented on the updating.md file. However, as I mentioned, it still blocks a straight forward upgrade for us. Actually, my stance is we can rename all references, it's not hard, but the safety focused stance won (merging a branch with renames will break v1.8.2 so you can only merge after upgrading, but the upgrade will break until you merge it, and then what if you need to roll back to v1.8.2 for another reason. Plus, there's no window more than 10 minutes when no dags need to be running, so…). [aside: I was going to update workers, then scheduler, then the webservers; but this would only have worked if needed changes could work in both the old and new version, and there's no DB changes; v18.2 to v1.9.0 had no DB changes, I was pretty hopeful.]

As much as I agree the deprecation warnings should also follow the described release cadence. I found that sqlalchemy and the flask framework are giving a decent number of their own deprecation warnings, so if someone could put in some work (is it just config work?) to squash all that warning stuff that's for developers of airflow vs users of airflow, it would probably make it easier to follow this pattern.

I follow your explanation that putting the `prefixed_conn_ids` into `default_args` was the reason for the varied naming. This is valid for Operators. For Hooks however, as they take only one connection, and they don't get `default_args` passed to them, this case doesn't support them needing a cornucopia of named parameters over just one. I do see where it's clearer if an Operator passes along the argument without renaming it, but I'm really more feeling the following:
If all hooks took their connection id by the same named parameter, then no changes to hooks would ever again _need_ a discussion about changing the name of that parameter. And if Hooks became free of this concern then is the convenience of centralized `default_args` worth the issue of Operators requiring vigilance or discussion during updates? Probably not.

However: 

> And yes, minor releases should never invalidate DAGs except for rare
> exceptions (say a new feature that is still pre-release, or never worked
> properly in previous release for some reason). Breaking changes should come
> with an UPDATE.md message aligned with the release. Pretty names doesn't
> justify breaking stuff and forcing people to grep and mass replace things.
> If someone wants a prettier name for an argument or anything else that's in
> the "obviously-public API" (DAG objects, operators, setting deps, ...) they
> should need to make the change backward compatible and start
> logging.warning() about next major release deprecation.

So, just to be clear, what you say here means no, let's not break the named parameters again. That's a fair argument, if this policy is in place.

But are you also saying: hey, we shouldn't have done that?
If not then I'm still feeling a little confused that this rename was fine once before, but never again even in the interest of longer term stable naming. Whereas if you are saying that, then what are the processes that keep this from happening? Should there be more persons per review? A couple of high-level policy reviewers that have to chime in on each PR? A tool (linter) or extra test-suite that checks for dag-file compatibility? A more active vote on the RC?

I had a lot of difficulty trying out the RCs as they became available. The first few I tried weren't working as intended. Then I left the decision to try and comment on the latter ones to others who had more time then. Finally, I had assumed someone might hit that `s3_conn_id` change and say: "wait that's supposed to work, lets fix it before we release" and if they hadn't then probably there's an easy fix for everyone, like just add plugin to v1.8.2 and change the imports to the airflow.shim.*, upgrade with the plugin in v1.9.0 then after code is migrated (if ever) remove the plugin.

Anyway, thanks for intending to set out a policy on the compatibility and your thoughts on possible further change plus the explanation of what the current naming style is used for.
-Daniel

On 5/30/18, 2:44 PM, "Maxime Beauchemin" <ma...@gmail.com> wrote:

    The main reason for the conn_id prefix is to facilitate the use of
    `default_args`. Because of this you can set all your connections at the top
    of your script and from that point on you just instantiate tasks without
    re-stating connections. It's common for people to define multiple
    "operating contexts" (production/staging/dev) where default_args and
    connections are defined conditionally based on some env vars and having
    different names for different conn_ids is key in that case.

    Also as you mentioned many operators (all data transfers) take 2 conn_ids
    which would require prefixing anyways.

    And yes, minor releases should never invalidate DAGs except for rare
    exceptions (say a new feature that is still pre-release, or never worked
    properly in previous release for some reason). Breaking changes should come
    with an UPDATE.md message aligned with the release. Pretty names doesn't
    justify breaking stuff and forcing people to grep and mass replace things.
    If someone wants a prettier name for an argument or anything else that's in
    the "obviously-public API" (DAG objects, operators, setting deps, ...) they
    should need to make the change backward compatible and start
    logging.warning() about next major release deprecation.

    I also think 2.0 should be as mild as possible on backward incompatible
    changes or come with a compatibility layer (from airflow import LegacyDAG
    as DAG). No one wants to go and mass update tons of scripts.

    It should be the case too for the less-obviously public API (hooks, methods
    not prefixed with `_` or `__`), though I think it may be reasonable in some
    cases (say a method that really should have been defined as private) but
    avoided as much as possible.

    *Committers*, let's be vigilant about this. Breaking backward compatibility
    is a big deal. An important part of code review is identifying backward
    incompatible changes.

    Max

    On Tue, May 29, 2018 at 6:19 PM Daniel (Daniel Lamblin) [BDP - Seoul] <
    lamblin@coupang.com> wrote:

    > The short of this email is: can we please name all the connection id named
    > parameters to all hooks and operators as similarly as possible. EG just
    > `conn_id`?
    >
    > So, when we started using Airflow I _had_ thought that minor versions
    > would be compatible for a user's DAG, assuming no use of anything marked
    > beta or deprecated, such that v1.7.1, 1.8.0, 1.8.1, 1.8.2 and 1.9.0 etc
    > would all run dags from prior versions in that linage, each with possible
    > stability and feature improvements and each with possibly more operators,
    > hooks, executors (even) etc.
    >
    > This is (now) obviously not the case, and it's the group's choice about
    > what should and what should not break on a release-by-release basis. I
    > think a clear policy would be appropriate for full Apache status, but then
    > I may have missed where the policy is defined.  Though, if defined as not
    > giving stability to the user's dags for most version changes isn't in my
    > opinion going to grow confidence for Airflow being something you can grow
    > with.
    >
    > Not to be overly dramatic, but currently the tiny change that the
    > `s3Hook(…)` takes `aws_conn_id='any_string'` vs
    > `s3_conn_id='still_any_string'` means that basically I have to maintain a
    > 1.8.2 setup in perpetuity, because no one (here) wants to briefly code
    > freeze before during and after an update so that we can update all the uses
    > and be ready to roll back the update if something else breaks (also some
    > people access the not-actually-private `_a_key and _s_key` and would need
    > to switch to still-not-private `_get_credentials()[…]`). Instead we'll just
    > run a second airflow setup at 1.9.0 (in each and every staged environment)
    > and move the 8k dags piecemeal whe[never] people get the spare time and
    > inclination. I mean, we might. It looks like 1.10.0 changes some config
    > around s3 logging one more time… so maybe it's better to skip it?
    >
    > I saw the couple of PRs where the project itself had to make the changes
    > to usages of the named field. There was momentary and passing concern that
    > users' dags would need to do the same. In the PRs, of the options discussed
    > (shims, supporting the deprecated name as deprecated, doing a hard rename),
    > it wasn't brought up if the rename to aws_conn_id was the best name. [Was
    > this discussed elsewhere?]
    >
    > And so I wondered why is there this virtual Hungarian notation on all the
    > `conn_id`s?
    > A hook generally takes one `conn_id`, most operators take only one. In
    > these cases couldn't the named parameter have been `conn_id`? When an
    > operator needs a couple conn_ids, could it not have `src_conn_id` and
    > `dest_conn_id` instead of locking in either s3 or aws, mysql or maria,
    > dataflow or beam etc? Those are hypotheticals, I believe.
    >
    > Could I propose to plan to break absolutely all uses of
    > `{named}_conn_id`s, before or by version 2.0, with an eye towards never
    > again having to break a named parameter for the rest of 2.x's life? There's
    > probably more named parameters that should be fixed per major release, and
    > if you agree, some work should be done to identify them all, but this just
    > happens to be the one that seems most likely to change often, and has so
    > recently.
    >
    > Thanks,
    > -Daniel
    >
    >

Re: conn_id breaking change; once more with feeling

Posted by Maxime Beauchemin <ma...@gmail.com>.

The main reason for the conn_id prefix is to facilitate the use of
`default_args`. Because of this you can set all your connections at the top
of your script and from that point on you just instantiate tasks without
re-stating connections. It's common for people to define multiple
"operating contexts" (production/staging/dev) where default_args and
connections are defined conditionally based on some env vars and having
different names for different conn_ids is key in that case.

Also as you mentioned many operators (all data transfers) take 2 conn_ids
which would require prefixing anyways.

And yes, minor releases should never invalidate DAGs except for rare
exceptions (say a new feature that is still pre-release, or never worked
properly in previous release for some reason). Breaking changes should come
with an UPDATE.md message aligned with the release. Pretty names doesn't
justify breaking stuff and forcing people to grep and mass replace things.
If someone wants a prettier name for an argument or anything else that's in
the "obviously-public API" (DAG objects, operators, setting deps, ...) they
should need to make the change backward compatible and start
logging.warning() about next major release deprecation.

I also think 2.0 should be as mild as possible on backward incompatible
changes or come with a compatibility layer (from airflow import LegacyDAG
as DAG). No one wants to go and mass update tons of scripts.

It should be the case too for the less-obviously public API (hooks, methods
not prefixed with `_` or `__`), though I think it may be reasonable in some
cases (say a method that really should have been defined as private) but
avoided as much as possible.

*Committers*, let's be vigilant about this. Breaking backward compatibility
is a big deal. An important part of code review is identifying backward
incompatible changes.

Max

On Tue, May 29, 2018 at 6:19 PM Daniel (Daniel Lamblin) [BDP - Seoul] <
lamblin@coupang.com> wrote:

> The short of this email is: can we please name all the connection id named
> parameters to all hooks and operators as similarly as possible. EG just
> `conn_id`?
>
> So, when we started using Airflow I _had_ thought that minor versions
> would be compatible for a user's DAG, assuming no use of anything marked
> beta or deprecated, such that v1.7.1, 1.8.0, 1.8.1, 1.8.2 and 1.9.0 etc
> would all run dags from prior versions in that linage, each with possible
> stability and feature improvements and each with possibly more operators,
> hooks, executors (even) etc.
>
> This is (now) obviously not the case, and it's the group's choice about
> what should and what should not break on a release-by-release basis. I
> think a clear policy would be appropriate for full Apache status, but then
> I may have missed where the policy is defined.  Though, if defined as not
> giving stability to the user's dags for most version changes isn't in my
> opinion going to grow confidence for Airflow being something you can grow
> with.
>
> Not to be overly dramatic, but currently the tiny change that the
> `s3Hook(…)` takes `aws_conn_id='any_string'` vs
> `s3_conn_id='still_any_string'` means that basically I have to maintain a
> 1.8.2 setup in perpetuity, because no one (here) wants to briefly code
> freeze before during and after an update so that we can update all the uses
> and be ready to roll back the update if something else breaks (also some
> people access the not-actually-private `_a_key and _s_key` and would need
> to switch to still-not-private `_get_credentials()[…]`). Instead we'll just
> run a second airflow setup at 1.9.0 (in each and every staged environment)
> and move the 8k dags piecemeal whe[never] people get the spare time and
> inclination. I mean, we might. It looks like 1.10.0 changes some config
> around s3 logging one more time… so maybe it's better to skip it?
>
> I saw the couple of PRs where the project itself had to make the changes
> to usages of the named field. There was momentary and passing concern that
> users' dags would need to do the same. In the PRs, of the options discussed
> (shims, supporting the deprecated name as deprecated, doing a hard rename),
> it wasn't brought up if the rename to aws_conn_id was the best name. [Was
> this discussed elsewhere?]
>
> And so I wondered why is there this virtual Hungarian notation on all the
> `conn_id`s?
> A hook generally takes one `conn_id`, most operators take only one. In
> these cases couldn't the named parameter have been `conn_id`? When an
> operator needs a couple conn_ids, could it not have `src_conn_id` and
> `dest_conn_id` instead of locking in either s3 or aws, mysql or maria,
> dataflow or beam etc? Those are hypotheticals, I believe.
>
> Could I propose to plan to break absolutely all uses of
> `{named}_conn_id`s, before or by version 2.0, with an eye towards never
> again having to break a named parameter for the rest of 2.x's life? There's
> probably more named parameters that should be fixed per major release, and
> if you agree, some work should be done to identify them all, but this just
> happens to be the one that seems most likely to change often, and has so
> recently.
>
> Thanks,
> -Daniel
>
>