You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jeremiah Lowin <jl...@apache.org> on 2017/02/04 18:45:46 UTC

Contrib & Dataflow

Max made some great points on my dataflow PR and I wanted to continue the
conversation here to make sure the conversation was visible to all.

While I think my dataflow implementation contains the basic requirements
for any more complicated extension (but that conversation can wait!), I had
to implement it by adding some very specific "dataflow-only" code to core
Operator logic. In retrospect, that makes me pause (as, I believe, it did
for Max).

After thinking for a few days, what I really want to do is propose a very
small change to core Airflow: change BaseOperator.post_execute(context) to
BaseOperator.post_execute(result, context). I think the pre_execute and
post_execute hooks have generally been an afterthought, but with that
change (which, I think, is reasonable in and of itself) I could implement
entirely through those hooks.

So that brings me to my next point: if the hook is changed, I could happily
drop a reworked dataflow implementation into contrib, rather than core.
That would alleviate some of the pressure for Airflow to officially decide
whether it's the right implementation or not (it is! :) ). I feel like that
would be the optimal situation at the moment.

And that brings me to my next point: the future of "contrib" and the
Airflow community.
Having contrib in the core Airflow repo has some advantages:
  - standardized access
  - centralized repository for PRs
  - at least a style review (if not unit tests) from the committers
But some big disadvantages as well:
  - Very complicated dependency management [presumably, most contrib
operators need to add an extras_require entry for their specific
dependencies]
  - No sense of ownership or even an easy way to raise issues (due to
friction of opening JIRA tickets vs github issues)

One thought is to move the contrib directory to its own repo which would
keep the advantages but remove the disadvantages from core Airflow. Another
is to encourage individual airflow repos (Airflow-Docker, Airflow-Dataflow,
Airflow-YourExtensionHere) which could be installed a la carte. That would
leave maintenance up to the original author, but could lead to some
fracturing in the community as discovery becomes difficult.

Re: Contrib & Dataflow

Posted by Laura Lorenz <ll...@industrydive.com>.
Re: data storage and file reference metadata as a process of the
post_execute hook
I'm interested to hear more on this idea, as I can't visualize how (or if)
that will implement multi-backend IO and either a standard or drop in
serialization of result objects.

I did just comment on the PR
<https://github.com/apache/incubator-airflow/pull/2046#issuecomment-278722340>
re: Max's comments since I wasn't totally sure where that conversation
should be had, but can move it over here if we want more visibility.

Re: breaking out repos
I know this has had some support for a while from eavesdropping this list
or committer meeting reports, but I want to throw out some of the gotchas
we experienced from deriving our own plugins (using the Airflow plugin
system <https://airflow.incubator.apache.org/plugins.html>) and then, when
that was too unwieldy for us because of the plugin module discovery system,
packaging some of our custom operators and hooks separately (fileflow
<https://www.github.com/industrydive/fileflow>). In the latter case, which
is closer to what you are proposing, we had problems patching into the core
Airflow configuration management system
<https://github.com/industrydive/fileflow/pull/6/commits/9374b02444d4d9b69121c5605f67d48e22a031fa>.
Now this could have been just us (or fixed up since Airflow 1.7.0, which is
the version we are still operating on) but is just a word of caution on
things to consider or redesign, given what we experienced packaging Airflow
add ons separately.

On Sat, Feb 4, 2017 at 1:45 PM, Jeremiah Lowin <jl...@apache.org> wrote:

> Max made some great points on my dataflow PR and I wanted to continue the
> conversation here to make sure the conversation was visible to all.
>
> While I think my dataflow implementation contains the basic requirements
> for any more complicated extension (but that conversation can wait!), I had
> to implement it by adding some very specific "dataflow-only" code to core
> Operator logic. In retrospect, that makes me pause (as, I believe, it did
> for Max).
>
> After thinking for a few days, what I really want to do is propose a very
> small change to core Airflow: change BaseOperator.post_execute(context) to
> BaseOperator.post_execute(result, context). I think the pre_execute and
> post_execute hooks have generally been an afterthought, but with that
> change (which, I think, is reasonable in and of itself) I could implement
> entirely through those hooks.
>
> So that brings me to my next point: if the hook is changed, I could happily
> drop a reworked dataflow implementation into contrib, rather than core.
> That would alleviate some of the pressure for Airflow to officially decide
> whether it's the right implementation or not (it is! :) ). I feel like that
> would be the optimal situation at the moment.
>
> And that brings me to my next point: the future of "contrib" and the
> Airflow community.
> Having contrib in the core Airflow repo has some advantages:
>   - standardized access
>   - centralized repository for PRs
>   - at least a style review (if not unit tests) from the committers
> But some big disadvantages as well:
>   - Very complicated dependency management [presumably, most contrib
> operators need to add an extras_require entry for their specific
> dependencies]
>   - No sense of ownership or even an easy way to raise issues (due to
> friction of opening JIRA tickets vs github issues)
>
> One thought is to move the contrib directory to its own repo which would
> keep the advantages but remove the disadvantages from core Airflow. Another
> is to encourage individual airflow repos (Airflow-Docker, Airflow-Dataflow,
> Airflow-YourExtensionHere) which could be installed a la carte. That would
> leave maintenance up to the original author, but could lead to some
> fracturing in the community as discovery becomes difficult.
>

Re: Contrib & Dataflow

Posted by Alex Van Boxel <al...@vanboxel.be>.
I like the idea. I already raised the issue so we could refactor all the
Google Cloud operators together and at that time make sure they are
consistent. So a different repo would be a good idea here. And you can
manage your own dependencies. Would be cool that the same thing happens to
the AWS operators.


On Sat, Feb 4, 2017 at 7:46 PM Jeremiah Lowin <jl...@apache.org> wrote:

> Max made some great points on my dataflow PR and I wanted to continue the
> conversation here to make sure the conversation was visible to all.
>
> While I think my dataflow implementation contains the basic requirements
> for any more complicated extension (but that conversation can wait!), I had
> to implement it by adding some very specific "dataflow-only" code to core
> Operator logic. In retrospect, that makes me pause (as, I believe, it did
> for Max).
>
> After thinking for a few days, what I really want to do is propose a very
> small change to core Airflow: change BaseOperator.post_execute(context) to
> BaseOperator.post_execute(result, context). I think the pre_execute and
> post_execute hooks have generally been an afterthought, but with that
> change (which, I think, is reasonable in and of itself) I could implement
> entirely through those hooks.
>
> So that brings me to my next point: if the hook is changed, I could happily
> drop a reworked dataflow implementation into contrib, rather than core.
> That would alleviate some of the pressure for Airflow to officially decide
> whether it's the right implementation or not (it is! :) ). I feel like that
> would be the optimal situation at the moment.
>
> And that brings me to my next point: the future of "contrib" and the
> Airflow community.
> Having contrib in the core Airflow repo has some advantages:
>   - standardized access
>   - centralized repository for PRs
>   - at least a style review (if not unit tests) from the committers
> But some big disadvantages as well:
>   - Very complicated dependency management [presumably, most contrib
> operators need to add an extras_require entry for their specific
> dependencies]
>   - No sense of ownership or even an easy way to raise issues (due to
> friction of opening JIRA tickets vs github issues)
>
> One thought is to move the contrib directory to its own repo which would
> keep the advantages but remove the disadvantages from core Airflow. Another
> is to encourage individual airflow repos (Airflow-Docker, Airflow-Dataflow,
> Airflow-YourExtensionHere) which could be installed a la carte. That would
> leave maintenance up to the original author, but could lead to some
> fracturing in the community as discovery becomes difficult.
>
-- 
  _/
_/ Alex Van Boxel