You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Kamil Olszewski <ka...@polidea.com> on 2020/09/01 11:35:28 UTC

Re: Generic Transfer Operator

Hello all,
since there have been no new comments shared in the POC doc
<https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit>
for a couple of days, then I will proceed with creating an AIP for this
feature, if that is ok with everybody.
Best regards,
Kamil
On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <tu...@apache.org>
wrote:

> I like the approach as it itnroduces another interesting operators'
> interface standarization. It would be awesome to here more opinions :)
>
> Cheers,
> Tomek
>
> On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
> > I like the idea a lot. Similar things have been discussed before but the
> > proposal is I think rather pragmatic and solves a real problem (and it
> does
> > not seem to be too complex to implement)
> >
> > There is some discussion about it already in the document (please
> chime-in
> > for those interested) but here a few points why I like it:
> >
> > - performance and optimization is not a focus for that. For generic stuff
> > it is usually to write "optimal" solution but once you admit you are not
> > going to focus for optimisation, you come with simpler and easier to use
> > solutions
> >
> > - on the other hand - it uses very "Python'y" approach with using
> > Airflow's familiar concepts (connection, transfer) and has the potential
> of
> > plugging in into 100s of hooks we have already easily - leveraging all
> the
> > "providers" richness of Airflow.
> >
> > - it aims to be easy to do "quick start" - if you have a number of
> > different sources/targets and as a data scientist you would like to
> quickly
> > start transferring data between them  - you can do it easily with only
> > basic python knowledge and simple DAG structure.
> >
> > - it should be possible to plug it in into our new functional approach as
> > well as future lineage discussions as it makes connection between sources
> > and targets
> >
> > - it opens up possibilities of adding simple and flexible data
> > transformation on-transfer. Not a replacement for any of the external
> > services that Airflow should use (Airflow is an orchestrator, not data
> > processing solution) but for the kind of quick-start scenarios I foresee
> it
> > might be most useful, being able to apply simple data transformation on
> the
> > fly by data scientist might be a big plus.
> >
> > Suggestion: Panda DataFrame as the format of the "data" component
> >
> > Kamil - you should have access now.
> >
> > J.
> >
> >
> > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > kamil.olszewski@polidea.com>
> > wrote:
> >
> > > Hello all,
> > > in Polidea we have come up with an idea for a generic transfer operator
> > > that would be able to transport data between two destinations of
> various
> > > types (file, database, storage, etc.) - please find the link with a
> short
> > > doc with POC
> > > <
> > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > >
> > > where we can discuss the design initially. Once we come to the initial
> > > conclusion I can create an AIP on cWiki - can I ask for permission to
> do
> > so
> > > (my id is 'kamil.olszewski')? I believe that during the discussion we
> > > should definitely aim for this feature to be released only after
> Airflow
> > > 2.0 is out.
> > >
> > > What do you think about this idea? Would you find such an operator
> > helpful
> > > in your pipelines? Maybe you already use a similar solution or know
> > > packages that could be used to implement it?
> > >
> > > Best regards,
> > > --
> > >
> > > Kamil Olszewski
> > > Polidea <https://www.polidea.com> | Software Engineer
> > >
> > > M: +48 503 361 783
> > > E: kamil.olszewski@polidea.com
> > >
> > > Unique Tech
> > > Check out our projects! <https://www.polidea.com/our-work>
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>


-- 

Kamil Olszewski
Polidea <https://www.polidea.com> | Software Engineer

M: +48 503 361 783
E: kamil.olszewski@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: Generic Transfer Operator

Posted by Jarek Potiuk <Ja...@polidea.com>.
+1. I'd also propose also to consider "both" rather than vs. They do not
have to be implemented at the same time nor even by the same people.
Those could even be done in two AIPs and we could vote whether we implement
one, or both.

J.




On Sun, Sep 6, 2020 at 5:20 PM Tomasz Urbaszek <tu...@apache.org> wrote:

> Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This
> one looks really interesting for blob storages transfer!
>
> As stated in the initial design doc I don't think we should focus on
> best performance but rather on versatility. Currently, we have many
> AtoB operators that do not yield the highest performance but do their
> work and are widely used.
>
> I would say that we should prepare an AIP that will propose two
> approaches: generic vs beam. This will allow us to compare them and
> then we can vote which one is better from the Airflow community
> perspective.
>
> What do you think?
>
> Tomek
>
>
> On Sun, Sep 6, 2020 at 2:42 PM Ash Berlin-Taylor <as...@apache.org> wrote:
> >
> > For background: in the past I had an S3 to S3 transfer using smartopen
> (since we wanted to split one giant ~300GB file onto smaller parts) and it
> took about 10mins, so even "large" uses can work fine in Airflow - no JVM
> required.
> >
> > -ash
> >
> > On 6 September 2020 12:01:24 BST, Tomasz Urbaszek <tu...@apache.org>
> wrote:
> > >I think using direct runner as default with the option to specify
> > >other setup is a win-win. However, there are few doubts I have about
> > >Beam based approach:
> > >
> > >1. Dependency management. If I do `pip install apache-airflow[gcp]`
> > >will it install `apache-beam[gcp]`? What if there's a version clash
> > >between dependencies?
> > >
> > >2. The initial approach using `DataSource` concept allowed users to
> > >use it in any operator (not only transfer ones). In case of relying on
> > >Beam we are losing this.
> > >
> > >3. I'm not a Beam expert but it seems to not support any data lineage
> > >solution?
> > >
> > >
> > >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
> > ><da...@gmail.com> wrote:
> > >>
> > >> I think there are absolutely use-cases for both. I’m totally fine
> > >with saying “for small/medium use-cases, we come with an in-house
> > >system. However for larger cases, you’ll require spark/Flink/S3. That’s
> > >totally in line with PLENTY of use-cases. This would be especially cool
> > >when matched with fast-follow as we could EVEN potentially tie in data
> > >locality.
> > >>
> > >> via Newton Mail
> > >[
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> ]
> > >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
> > ><wh...@gmail.com> wrote:
> > >> I believe - for not large data - the direct runner is wholly doable,
> > >which
> > >> seems in line with airflow patterns. I have, and have spoken with
> > >several
> > >> others that have, been productive with that runner.
> > >>
> > >> For much larger transfers, the generic operator could accept
> > >parameters for
> > >> submitting the compute to an actual runner. Though, imagining that
> > >> (needing a runner) would not be the primary use case for such an
> > >operator.
> > >>
> > >>
> > >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <tu...@apache.org>
> > >wrote:
> > >>
> > >> > Austin, you are right, Beam covers all (and more) important IOs.
> > >> > However, using Apache Beam to design a generic transfer operator
> > >> > requires Airflow users to have additional resources that will be
> > >used
> > >> > as a runner (Spark, Flink, etc.). Unless you suggest using
> > >> > DirectRunner?
> > >> >
> > >> > Can you please tell us more how exactly you think we can use Beam
> > >for
> > >> > those Airflow transfer operators?
> > >> >
> > >> > Best,
> > >> > Tomek
> > >> >
> > >> >
> > >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> > >> > <wh...@gmail.com> wrote:
> > >> > >
> > >> > > Are there IOs that would be desired for a generic transfer
> > >operator that
> > >> > > don't exist in:
> > >https://beam.apache.org/documentation/io/built-in/ <-
> > >> > > there is pretty solid coverage?
> > >> > >
> > >> > > Beam is getting to the point where even python beam can leverage
> > >the java
> > >> > > IOs, which increases the range of IOs (and performance).
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
> > ><Ja...@polidea.com>
> > >> > > wrote:
> > >> > >
> > >> > > > But I believe those two ideas are separate ones as Tomek
> > >explained :)
> > >> > > >
> > >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
> > ><Jarek.Potiuk@polidea.com
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > I love the idea of connecting the projects more closely!
> > >> > > > >
> > >> > > > > I've been helping recently as a consultant in improving the
> > >Apache
> > >> > Beam
> > >> > > > > build infrastructure (in many parts based on my Airflow
> > >experience
> > >> > and
> > >> > > > > Github Actions - even recently they adopted the "cancel"
> > >action I
> > >> > > > developed
> > >> > > > > for Apache Airflow).
> > >https://github.com/apache/beam/pull/12729
> > >> > > > >
> > >> > > > > Synergies in Apache projects are cool.
> > >> > > > >
> > >> > > > > J.
> > >> > > > >
> > >> > > > >
> > >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > >> > > > > <gc...@twitter.com.invalid> wrote:
> > >> > > > >
> > >> > > > >> Agree on keeping those separate, just intervened as I
> > >believe its a
> > >> > > > great
> > >> > > > >> idea. But lets keep @beam and @spark to a separate thread.
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> Gerard Casas Saez
> > >> > > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> > >> > turbaszek@apache.org>
> > >> > > > >> wrote:
> > >> > > > >>
> > >> > > > >> > Daniel is right we have few Apache Beam committers in
> > >Polidea so
> > >> > we
> > >> > > > >> > will ask for advice. However, I would be highly in favor
> > >of
> > >> > having it
> > >> > > > >> > as Gerard suggested as @beam decorator. This is something
> > >we
> > >> > should
> > >> > > > >> > put into another AIP together with the mentioned @spark
> > >decorator.
> > >> > > > >> >
> > >> > > > >> > Our proposition of transfer operators was mainly to create
> > >> > something
> > >> > > > >> > Airflow-native that works out of the box and allows us to
> > >simplify
> > >> > > > >> > read/write from external sources. Thus, it requires no
> > >external
> > >> > > > >> > dependency other than the library to communicate with the
> > >API. In
> > >> > the
> > >> > > > >> > case of Beam we need more than that I think.
> > >> > > > >> >
> > >> > > > >> > Additionally, the ideas of Source and Destination play
> > >nicely with
> > >> > > > >> > data lineage and may bring more interest to this feature
> > >of
> > >> > Airflow.
> > >> > > > >> >
> > >> > > > >> > Cheers,
> > >> > > > >> > Tomek
> > >> > > > >> >
> > >> > > > >> >
> > >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
> > ><ka...@gmail.com>
> > >> > > > wrote:
> > >> > > > >> > >
> > >> > > > >> > > Nice. Just a note here, we will need to make sure that
> > >those
> > >> > > > "Source"
> > >> > > > >> and
> > >> > > > >> > > "Destination" needs to be serializable.
> > >> > > > >> > >
> > >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > >> > > > daniel.imberman@gmail.com
> > >> > > > >> >
> > >> > > > >> > > wrote:
> > >> > > > >> > >
> > >> > > > >> > > > Interesting! Beam also could potentially allow
> > >transfers
> > >> > within
> > >> > > > >> > Dask/any
> > >> > > > >> > > > other system with a java/python SDK? I think @jarek
> > >and
> > >> > Polidea
> > >> > > > do a
> > >> > > > >> > lot of
> > >> > > > >> > > > work with Beam as well so I’d love their thoughts if
> > >this a
> > >> > good
> > >> > > > >> > use-case.
> > >> > > > >> > > >
> > >> > > > >> > > > via Newton Mail [
> > >> > > > >> > > >
> > >> > > > >> >
> > >> > > > >>
> > >> > > >
> > >> >
> > >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > >> > > > >> > > > ]
> > >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> > >> > > > >> > gcasassaez@twitter.com.invalid>
> > >> > > > >> > > > wrote:
> > >> > > > >> > > > I would be highly in favour of having a generic Beam
> > >operator.
> > >> > > > >> Similar
> > >> > > > >> > > > to @spark_task decorator. Something where you can
> > >easily
> > >> > define
> > >> > > > and
> > >> > > > >> > wrap a
> > >> > > > >> > > > beam pipeline and convert it to an Airflow operator.
> > >> > > > >> > > >
> > >> > > > >> > > > Gerard Casas Saez
> > >> > > > >> > > > Twitter | Cortex | @casassaez
> > ><http://twitter.com/casassaez>
> > >> > > > >> > > >
> > >> > > > >> > > >
> > >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > >> > > > >> > > > whatwouldaustindo@gmail.com>
> > >> > > > >> > > > wrote:
> > >> > > > >> > > >
> > >> > > > >> > > > > Are you guys familiar with Beam
> > ><https://beam.apache.org>?
> > >> > Esp.
> > >> > > > >> if
> > >> > > > >> > not
> > >> > > > >> > > > > doing transforms, it might rather straightforward to
> > >rely
> > >> > on the
> > >> > > > >> > > > ecosystem
> > >> > > > >> > > > > of connectors in that Apache Project to use as the
> > >> > foundations
> > >> > > > >> for a
> > >> > > > >> > > > > generic transfer operator.
> > >> > > > >> > > > >
> > >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > >> > > > >> > Jarek.Potiuk@polidea.com>
> > >> > > > >> > > > > wrote:
> > >> > > > >> > > > >
> > >> > > > >> > > > > > +1
> > >> > > > >> > > > > >
> > >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > >> > > > >> > > > > > kamil.olszewski@polidea.com>
> > >> > > > >> > > > > > wrote:
> > >> > > > >> > > > > >
> > >> > > > >> > > > > > > Hello all,
> > >> > > > >> > > > > > > since there have been no new comments shared in
> > >the POC
> > >> > doc
> > >> > > > >> > > > > > > <
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > >
> > >> > > > >> > > > >
> > >> > > > >> > > >
> > >> > > > >> >
> > >> > > > >>
> > >> > > >
> > >> >
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > >> > > > >> > > > > > > >
> > >> > > > >> > > > > > > for a couple of days, then I will proceed with
> > >creating
> > >> > an
> > >> > > > AIP
> > >> > > > >> > for
> > >> > > > >> > > > this
> > >> > > > >> > > > > > > feature, if that is ok with everybody.
> > >> > > > >> > > > > > > Best regards,
> > >> > > > >> > > > > > > Kamil
> > >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek
> > ><
> > >> > > > >> > > > turbaszek@apache.org
> > >> > > > >> > > > > >
> > >> > > > >> > > > > > > wrote:
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > > > > I like the approach as it itnroduces another
> > >> > interesting
> > >> > > > >> > operators'
> > >> > > > >> > > > > > > > interface standarization. It would be awesome
> > >to here
> > >> > more
> > >> > > > >> > opinions
> > >> > > > >> > > > > :)
> > >> > > > >> > > > > > > >
> > >> > > > >> > > > > > > > Cheers,
> > >> > > > >> > > > > > > > Tomek
> > >> > > > >> > > > > > > >
> > >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > >> > > > >> > > > > Jarek.Potiuk@polidea.com
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > > > > wrote:
> > >> > > > >> > > > > > > >
> > >> > > > >> > > > > > > > > I like the idea a lot. Similar things have
> > >been
> > >> > > > discussed
> > >> > > > >> > before
> > >> > > > >> > > > > but
> > >> > > > >> > > > > > > the
> > >> > > > >> > > > > > > > > proposal is I think rather pragmatic and
> > >solves a
> > >> > real
> > >> > > > >> > problem
> > >> > > > >> > > > (and
> > >> > > > >> > > > > > it
> > >> > > > >> > > > > > > > does
> > >> > > > >> > > > > > > > > not seem to be too complex to implement)
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > There is some discussion about it already in
> > >the
> > >> > > > document
> > >> > > > >> > (please
> > >> > > > >> > > > > > > > chime-in
> > >> > > > >> > > > > > > > > for those interested) but here a few points
> > >why I
> > >> > like
> > >> > > > it:
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > - performance and optimization is not a
> > >focus for
> > >> > that.
> > >> > > > >> For
> > >> > > > >> > > > generic
> > >> > > > >> > > > > > > stuff
> > >> > > > >> > > > > > > > > it is usually to write "optimal" solution
> > >but once
> > >> > you
> > >> > > > >> admit
> > >> > > > >> > you
> > >> > > > >> > > > > are
> > >> > > > >> > > > > > > not
> > >> > > > >> > > > > > > > > going to focus for optimisation, you come
> > >with
> > >> > simpler
> > >> > > > and
> > >> > > > >> > easier
> > >> > > > >> > > > > to
> > >> > > > >> > > > > > > use
> > >> > > > >> > > > > > > > > solutions
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > - on the other hand - it uses very
> > >"Python'y"
> > >> > approach
> > >> > > > >> with
> > >> > > > >> > using
> > >> > > > >> > > > > > > > > Airflow's familiar concepts (connection,
> > >transfer)
> > >> > and
> > >> > > > has
> > >> > > > >> > the
> > >> > > > >> > > > > > > potential
> > >> > > > >> > > > > > > > of
> > >> > > > >> > > > > > > > > plugging in into 100s of hooks we have
> > >already
> > >> > easily -
> > >> > > > >> > > > leveraging
> > >> > > > >> > > > > > all
> > >> > > > >> > > > > > > > the
> > >> > > > >> > > > > > > > > "providers" richness of Airflow.
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > - it aims to be easy to do "quick start" -
> > >if you
> > >> > have a
> > >> > > > >> > number
> > >> > > > >> > > > of
> > >> > > > >> > > > > > > > > different sources/targets and as a data
> > >scientist
> > >> > you
> > >> > > > >> would
> > >> > > > >> > like
> > >> > > > >> > > > to
> > >> > > > >> > > > > > > > quickly
> > >> > > > >> > > > > > > > > start transferring data between them - you
> > >can do it
> > >> > > > >> easily
> > >> > > > >> > with
> > >> > > > >> > > > > > only
> > >> > > > >> > > > > > > > > basic python knowledge and simple DAG
> > >structure.
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > - it should be possible to plug it in into
> > >our new
> > >> > > > >> functional
> > >> > > > >> > > > > > approach
> > >> > > > >> > > > > > > as
> > >> > > > >> > > > > > > > > well as future lineage discussions as it
> > >makes
> > >> > > > connection
> > >> > > > >> > between
> > >> > > > >> > > > > > > sources
> > >> > > > >> > > > > > > > > and targets
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > - it opens up possibilities of adding simple
> > >and
> > >> > > > flexible
> > >> > > > >> > data
> > >> > > > >> > > > > > > > > transformation on-transfer. Not a
> > >replacement for
> > >> > any of
> > >> > > > >> the
> > >> > > > >> > > > > external
> > >> > > > >> > > > > > > > > services that Airflow should use (Airflow is
> > >an
> > >> > > > >> > orchestrator, not
> > >> > > > >> > > > > > data
> > >> > > > >> > > > > > > > > processing solution) but for the kind of
> > >quick-start
> > >> > > > >> > scenarios I
> > >> > > > >> > > > > > > foresee
> > >> > > > >> > > > > > > > it
> > >> > > > >> > > > > > > > > might be most useful, being able to apply
> > >simple
> > >> > data
> > >> > > > >> > > > > transformation
> > >> > > > >> > > > > > on
> > >> > > > >> > > > > > > > the
> > >> > > > >> > > > > > > > > fly by data scientist might be a big plus.
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of
> > >the
> > >> > "data"
> > >> > > > >> > component
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > Kamil - you should have access now.
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > J.
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
> > >Olszewski <
> > >> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
> > >> > > > >> > > > > > > > > wrote:
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > > Hello all,
> > >> > > > >> > > > > > > > > > in Polidea we have come up with an idea
> > >for a
> > >> > generic
> > >> > > > >> > transfer
> > >> > > > >> > > > > > > operator
> > >> > > > >> > > > > > > > > > that would be able to transport data
> > >between two
> > >> > > > >> > destinations
> > >> > > > >> > > > of
> > >> > > > >> > > > > > > > various
> > >> > > > >> > > > > > > > > > types (file, database, storage, etc.) -
> > >please
> > >> > find
> > >> > > > the
> > >> > > > >> > link
> > >> > > > >> > > > > with a
> > >> > > > >> > > > > > > > short
> > >> > > > >> > > > > > > > > > doc with POC
> > >> > > > >> > > > > > > > > > <
> > >> > > > >> > > > > > > > > >
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > >
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > >
> > >> > > > >> > > > >
> > >> > > > >> > > >
> > >> > > > >> >
> > >> > > > >>
> > >> > > >
> > >> >
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > >> > > > >> > > > > > > > > > >
> > >> > > > >> > > > > > > > > > where we can discuss the design initially.
> > >Once we
> > >> > > > come
> > >> > > > >> to
> > >> > > > >> > the
> > >> > > > >> > > > > > > initial
> > >> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki -
> > >can I
> > >> > ask
> > >> > > > for
> > >> > > > >> > > > > permission
> > >> > > > >> > > > > > to
> > >> > > > >> > > > > > > > do
> > >> > > > >> > > > > > > > > so
> > >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe
> > >that
> > >> > during
> > >> > > > the
> > >> > > > >> > > > > discussion
> > >> > > > >> > > > > > we
> > >> > > > >> > > > > > > > > > should definitely aim for this feature to
> > >be
> > >> > released
> > >> > > > >> only
> > >> > > > >> > > > after
> > >> > > > >> > > > > > > > Airflow
> > >> > > > >> > > > > > > > > > 2.0 is out.
> > >> > > > >> > > > > > > > > >
> > >> > > > >> > > > > > > > > > What do you think about this idea? Would
> > >you find
> > >> > such
> > >> > > > >> an
> > >> > > > >> > > > > operator
> > >> > > > >> > > > > > > > > helpful
> > >> > > > >> > > > > > > > > > in your pipelines? Maybe you already use a
> > >similar
> > >> > > > >> > solution or
> > >> > > > >> > > > > know
> > >> > > > >> > > > > > > > > > packages that could be used to implement
> > >it?
> > >> > > > >> > > > > > > > > >
> > >> > > > >> > > > > > > > > > Best regards,
> > >> > > > >> > > > > > > > > > --
> > >> > > > >> > > > > > > > > >
> > >> > > > >> > > > > > > > > > Kamil Olszewski
> > >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
> > >Software
> > >> > Engineer
> > >> > > > >> > > > > > > > > >
> > >> > > > >> > > > > > > > > > M: +48 503 361 783
> > >> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> > >> > > > >> > > > > > > > > >
> > >> > > > >> > > > > > > > > > Unique Tech
> > >> > > > >> > > > > > > > > > Check out our projects! <
> > >> > > > >> https://www.polidea.com/our-work>
> > >> > > > >> > > > > > > > > >
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > --
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > Jarek Potiuk
> > >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
> > >Principal
> > >> > Software
> > >> > > > >> > Engineer
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > >> > > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
> > >> > > > >> > > > > > > > >
> > >> > > > >> > > > > > > >
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > > > --
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > > > Kamil Olszewski
> > >> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software
> > >Engineer
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > > > M: +48 503 361 783
> > >> > > > >> > > > > > > E: kamil.olszewski@polidea.com
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > > > Unique Tech
> > >> > > > >> > > > > > > Check out our projects! <
> > >> > https://www.polidea.com/our-work>
> > >> > > > >> > > > > > >
> > >> > > > >> > > > > >
> > >> > > > >> > > > > >
> > >> > > > >> > > > > > --
> > >> > > > >> > > > > >
> > >> > > > >> > > > > > Jarek Potiuk
> > >> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal
> > >Software
> > >> > > > >> Engineer
> > >> > > > >> > > > > >
> > >> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
> > >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > >> > > > >> > > > > >
> > >> > > > >> > > > >
> > >> > > > >> >
> > >> > > > >> >
> > >> > > > >> >
> > >> > > > >> > --
> > >> > > > >> >
> > >> > > > >> > Tomasz Urbaszek
> > >> > > > >> > Polidea | Software Engineer
> > >> > > > >> >
> > >> > > > >> > M: +48 505 628 493
> > >> > > > >> > E: tomasz.urbaszek@polidea.com
> > >> > > > >> >
> > >> > > > >> > Unique Tech
> > >> > > > >> > Check out our projects!
> > >> > > > >> >
> > >> > > > >>
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > >
> > >> > > > > Jarek Potiuk
> > >> > > > > Polidea <https://www.polidea.com/> | Principal Software
> > >Engineer
> > >> > > > >
> > >> > > > > M: +48 660 796 129 <+48660796129>
> > >> > > > > [image: Polidea] <https://www.polidea.com/>
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > > > --
> > >> > > >
> > >> > > > Jarek Potiuk
> > >> > > > Polidea <https://www.polidea.com/> | Principal Software
> > >Engineer
> > >> > > >
> > >> > > > M: +48 660 796 129 <+48660796129>
> > >> > > > [image: Polidea] <https://www.polidea.com/>
> > >> > > >
> > >> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Generic Transfer Operator

Posted by Austin Bennett <wh...@gmail.com>.
I would abstract Beam away from end-users.  The goal was minimal to no
transforms in the generic operator?  So, yes, would have a beam dependency,
but airflow-like interaction/wrapper (and then the ability to also extend
and generate desired transforms).

It sounds like dependency is a concern, which is at least reasonable to at
least give pause.

On Fri, Sep 11, 2020 at 1:38 AM Kamil Olszewski <ka...@polidea.com>
wrote:

> I think the biggest downsides were already mentioned by Tomek: more
> dependency management when using apache-beam (plus possibility of conflicts
> between dependencies of beam and airflow) and no support for data lineage
> solutions. Besides, we create a higher entry threshold by creating a
> necessity to understand both beam and airflow concepts. That's why I am
> also in favor of considering both generic and beam approaches. Maybe we
> will be able adapt some concepts for generic approach from beam without
> creating a direct dependency. If no one is against it, I will try to take a
> closer look at Beam concepts and create an AIP next week.
>
> Kamil
>
> On Mon, Sep 7, 2020 at 3:54 PM Daniel Imberman <da...@gmail.com>
> wrote:
>
> > Ok that’s awesome. I’m also seeing that they have an s3 IO setting [
> >
> https://beam.apache.org/releases/pydoc/2.23.0/apache_beam.io.aws.s3io.html
> ]
> > . Seems that if it’s just a pip install we could start out with just File
> > (I imagine on kubernetes this could even work with volume mounts) and S3,
> > and then add more as time goes on? Are there any downsides with us tying
> > this into Beam? (e.g. if we want to use a storage system not yet
> supported
> > by beam).
> > via Newton Mail [
> >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > ]
> > On Sun, Sep 6, 2020 at 1:24 PM, Tomasz Urbaszek <tu...@apache.org>
> > wrote:
> > I checked it with our Beam team and DirectRunner is supported by
> > Python SDK and requires no JVM. That's the main reason I think it's
> > worth considering it :) Hard dependency od JVM would be probably a
> > no-go for us.
> > https://beam.apache.org/documentation/runners/direct/
> >
> > Tomek
> >
> >
> > On Sun, Sep 6, 2020 at 9:45 PM Daniel Imberman
> > <da...@gmail.com> wrote:
> > >
> > > Oof ok yeah. I hadn't realized that beam had a hard JVM requirement. I
> > > think that initially offering a local or block storage based solution
> > with
> > > easy extensions for users is totally in line with airflow philosophy. I
> > > think that offering alternative transfer operators inproviders is a
> great
> > > idea!
> > >
> > > On Sun, Sep 6, 2020, 9:07 AM Ash Berlin-Taylor <as...@apache.org> wrote:
> > >
> > > > No strong opinion - but it seems like generic is the easiest for us
> to
> > > > code (as we have most of it already via hooks?) and adopt (and
> doesn't
> > > > place a hard requirement on Beam/JVM, even if JVM would only be
> > runtime.
> > > > Still)
> > > >
> > > > This is possibly where Airflow has a core TransferOperator, and
> > > > providers.apache.beam.operators.BeamTransferOperator? If the "same"
> > python
> > > > API could be used for both, and it doesn't needlessly complicated
> > things.
> > > >
> > > > -a
> > > >
> > > > On 6 September 2020 16:20:37 BST, Tomasz Urbaszek <
> > turbaszek@apache.org>
> > > > wrote:
> > > > >Thanks, Ash for pointing to https://pypi.org/project/smart-open/
> This
> > > > >one looks really interesting for blob storages transfer!
> > > > >
> > > > >As stated in the initial design doc I don't think we should focus on
> > > > >best performance but rather on versatility. Currently, we have many
> > > > >AtoB operators that do not yield the highest performance but do
> their
> > > > >work and are widely used.
> > > > >
> > > > >I would say that we should prepare an AIP that will propose two
> > > > >approaches: generic vs beam. This will allow us to compare them and
> > > > >then we can vote which one is better from the Airflow community
> > > > >perspective.
> > > > >
> > > > >What do you think?
> > > > >
> > > > >Tomek
> > > > >
> > > > >
> > > > >On Sun, Sep 6, 2020 at 2:42 PM Ash Berlin-Taylor <as...@apache.org>
> > > > >wrote:
> > > > >>
> > > > >> For background: in the past I had an S3 to S3 transfer using
> > > > >smartopen (since we wanted to split one giant ~300GB file onto
> smaller
> > > > >parts) and it took about 10mins, so even "large" uses can work fine
> in
> > > > >Airflow - no JVM required.
> > > > >>
> > > > >> -ash
> > > > >>
> > > > >> On 6 September 2020 12:01:24 BST, Tomasz Urbaszek
> > > > ><tu...@apache.org> wrote:
> > > > >> >I think using direct runner as default with the option to specify
> > > > >> >other setup is a win-win. However, there are few doubts I have
> > about
> > > > >> >Beam based approach:
> > > > >> >
> > > > >> >1. Dependency management. If I do `pip install
> apache-airflow[gcp]`
> > > > >> >will it install `apache-beam[gcp]`? What if there's a version
> clash
> > > > >> >between dependencies?
> > > > >> >
> > > > >> >2. The initial approach using `DataSource` concept allowed users
> to
> > > > >> >use it in any operator (not only transfer ones). In case of
> relying
> > > > >on
> > > > >> >Beam we are losing this.
> > > > >> >
> > > > >> >3. I'm not a Beam expert but it seems to not support any data
> > > > >lineage
> > > > >> >solution?
> > > > >> >
> > > > >> >
> > > > >> >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
> > > > >> ><da...@gmail.com> wrote:
> > > > >> >>
> > > > >> >> I think there are absolutely use-cases for both. I’m totally
> fine
> > > > >> >with saying “for small/medium use-cases, we come with an in-house
> > > > >> >system. However for larger cases, you’ll require spark/Flink/S3.
> > > > >That’s
> > > > >> >totally in line with PLENTY of use-cases. This would be
> especially
> > > > >cool
> > > > >> >when matched with fast-follow as we could EVEN potentially tie in
> > > > >data
> > > > >> >locality.
> > > > >> >>
> > > > >> >> via Newton Mail
> > > > >>
> > > > >>[
> > > >
> >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > > ]
> > > > >> >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
> > > > >> ><wh...@gmail.com> wrote:
> > > > >> >> I believe - for not large data - the direct runner is wholly
> > > > >doable,
> > > > >> >which
> > > > >> >> seems in line with airflow patterns. I have, and have spoken
> with
> > > > >> >several
> > > > >> >> others that have, been productive with that runner.
> > > > >> >>
> > > > >> >> For much larger transfers, the generic operator could accept
> > > > >> >parameters for
> > > > >> >> submitting the compute to an actual runner. Though, imagining
> > that
> > > > >> >> (needing a runner) would not be the primary use case for such
> an
> > > > >> >operator.
> > > > >> >>
> > > > >> >>
> > > > >> >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek
> > > > ><tu...@apache.org>
> > > > >> >wrote:
> > > > >> >>
> > > > >> >> > Austin, you are right, Beam covers all (and more) important
> > IOs.
> > > > >> >> > However, using Apache Beam to design a generic transfer
> > operator
> > > > >> >> > requires Airflow users to have additional resources that will
> > be
> > > > >> >used
> > > > >> >> > as a runner (Spark, Flink, etc.). Unless you suggest using
> > > > >> >> > DirectRunner?
> > > > >> >> >
> > > > >> >> > Can you please tell us more how exactly you think we can use
> > > > >Beam
> > > > >> >for
> > > > >> >> > those Airflow transfer operators?
> > > > >> >> >
> > > > >> >> > Best,
> > > > >> >> > Tomek
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> > > > >> >> > <wh...@gmail.com> wrote:
> > > > >> >> > >
> > > > >> >> > > Are there IOs that would be desired for a generic transfer
> > > > >> >operator that
> > > > >> >> > > don't exist in:
> > > > >> >https://beam.apache.org/documentation/io/built-in/ <-
> > > > >> >> > > there is pretty solid coverage?
> > > > >> >> > >
> > > > >> >> > > Beam is getting to the point where even python beam can
> > > > >leverage
> > > > >> >the java
> > > > >> >> > > IOs, which increases the range of IOs (and performance).
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
> > > > >> ><Ja...@polidea.com>
> > > > >> >> > > wrote:
> > > > >> >> > >
> > > > >> >> > > > But I believe those two ideas are separate ones as Tomek
> > > > >> >explained :)
> > > > >> >> > > >
> > > > >> >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
> > > > >> ><Jarek.Potiuk@polidea.com
> > > > >> >> > >
> > > > >> >> > > > wrote:
> > > > >> >> > > >
> > > > >> >> > > > > I love the idea of connecting the projects more
> closely!
> > > > >> >> > > > >
> > > > >> >> > > > > I've been helping recently as a consultant in improving
> > > > >the
> > > > >> >Apache
> > > > >> >> > Beam
> > > > >> >> > > > > build infrastructure (in many parts based on my Airflow
> > > > >> >experience
> > > > >> >> > and
> > > > >> >> > > > > Github Actions - even recently they adopted the
> "cancel"
> > > > >> >action I
> > > > >> >> > > > developed
> > > > >> >> > > > > for Apache Airflow).
> > > > >> >https://github.com/apache/beam/pull/12729
> > > > >> >> > > > >
> > > > >> >> > > > > Synergies in Apache projects are cool.
> > > > >> >> > > > >
> > > > >> >> > > > > J.
> > > > >> >> > > > >
> > > > >> >> > > > >
> > > > >> >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > > > >> >> > > > > <gc...@twitter.com.invalid> wrote:
> > > > >> >> > > > >
> > > > >> >> > > > >> Agree on keeping those separate, just intervened as I
> > > > >> >believe its a
> > > > >> >> > > > great
> > > > >> >> > > > >> idea. But lets keep @beam and @spark to a separate
> > > > >thread.
> > > > >> >> > > > >>
> > > > >> >> > > > >>
> > > > >> >> > > > >> Gerard Casas Saez
> > > > >> >> > > > >> Twitter | Cortex | @casassaez
> > > > ><http://twitter.com/casassaez>
> > > > >> >> > > > >>
> > > > >> >> > > > >>
> > > > >> >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> > > > >> >> > turbaszek@apache.org>
> > > > >> >> > > > >> wrote:
> > > > >> >> > > > >>
> > > > >> >> > > > >> > Daniel is right we have few Apache Beam committers
> in
> > > > >> >Polidea so
> > > > >> >> > we
> > > > >> >> > > > >> > will ask for advice. However, I would be highly in
> > > > >favor
> > > > >> >of
> > > > >> >> > having it
> > > > >> >> > > > >> > as Gerard suggested as @beam decorator. This is
> > > > >something
> > > > >> >we
> > > > >> >> > should
> > > > >> >> > > > >> > put into another AIP together with the mentioned
> > @spark
> > > > >> >decorator.
> > > > >> >> > > > >> >
> > > > >> >> > > > >> > Our proposition of transfer operators was mainly to
> > > > >create
> > > > >> >> > something
> > > > >> >> > > > >> > Airflow-native that works out of the box and allows
> us
> > > > >to
> > > > >> >simplify
> > > > >> >> > > > >> > read/write from external sources. Thus, it requires
> no
> > > > >> >external
> > > > >> >> > > > >> > dependency other than the library to communicate
> with
> > > > >the
> > > > >> >API. In
> > > > >> >> > the
> > > > >> >> > > > >> > case of Beam we need more than that I think.
> > > > >> >> > > > >> >
> > > > >> >> > > > >> > Additionally, the ideas of Source and Destination
> play
> > > > >> >nicely with
> > > > >> >> > > > >> > data lineage and may bring more interest to this
> > > > >feature
> > > > >> >of
> > > > >> >> > Airflow.
> > > > >> >> > > > >> >
> > > > >> >> > > > >> > Cheers,
> > > > >> >> > > > >> > Tomek
> > > > >> >> > > > >> >
> > > > >> >> > > > >> >
> > > > >> >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
> > > > >> ><ka...@gmail.com>
> > > > >> >> > > > wrote:
> > > > >> >> > > > >> > >
> > > > >> >> > > > >> > > Nice. Just a note here, we will need to make sure
> > > > >that
> > > > >> >those
> > > > >> >> > > > "Source"
> > > > >> >> > > > >> and
> > > > >> >> > > > >> > > "Destination" needs to be serializable.
> > > > >> >> > > > >> > >
> > > > >> >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > > > >> >> > > > daniel.imberman@gmail.com
> > > > >> >> > > > >> >
> > > > >> >> > > > >> > > wrote:
> > > > >> >> > > > >> > >
> > > > >> >> > > > >> > > > Interesting! Beam also could potentially allow
> > > > >> >transfers
> > > > >> >> > within
> > > > >> >> > > > >> > Dask/any
> > > > >> >> > > > >> > > > other system with a java/python SDK? I think
> > @jarek
> > > > >> >and
> > > > >> >> > Polidea
> > > > >> >> > > > do a
> > > > >> >> > > > >> > lot of
> > > > >> >> > > > >> > > > work with Beam as well so I’d love their
> thoughts
> > > > >if
> > > > >> >this a
> > > > >> >> > good
> > > > >> >> > > > >> > use-case.
> > > > >> >> > > > >> > > >
> > > > >> >> > > > >> > > > via Newton Mail [
> > > > >> >> > > > >> > > >
> > > > >> >> > > > >> >
> > > > >> >> > > > >>
> > > > >> >> > > >
> > > > >> >> >
> > > > >>
> > > > >>
> > > >
> >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > > >> >> > > > >> > > > ]
> > > > >> >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas
> Saez
> > > > ><
> > > > >> >> > > > >> > gcasassaez@twitter.com.invalid>
> > > > >> >> > > > >> > > > wrote:
> > > > >> >> > > > >> > > > I would be highly in favour of having a generic
> > > > >Beam
> > > > >> >operator.
> > > > >> >> > > > >> Similar
> > > > >> >> > > > >> > > > to @spark_task decorator. Something where you
> can
> > > > >> >easily
> > > > >> >> > define
> > > > >> >> > > > and
> > > > >> >> > > > >> > wrap a
> > > > >> >> > > > >> > > > beam pipeline and convert it to an Airflow
> > > > >operator.
> > > > >> >> > > > >> > > >
> > > > >> >> > > > >> > > > Gerard Casas Saez
> > > > >> >> > > > >> > > > Twitter | Cortex | @casassaez
> > > > >> ><http://twitter.com/casassaez>
> > > > >> >> > > > >> > > >
> > > > >> >> > > > >> > > >
> > > > >> >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > > > >> >> > > > >> > > > whatwouldaustindo@gmail.com>
> > > > >> >> > > > >> > > > wrote:
> > > > >> >> > > > >> > > >
> > > > >> >> > > > >> > > > > Are you guys familiar with Beam
> > > > >> ><https://beam.apache.org>?
> > > > >> >> > Esp.
> > > > >> >> > > > >> if
> > > > >> >> > > > >> > not
> > > > >> >> > > > >> > > > > doing transforms, it might rather
> > straightforward
> > > > >to
> > > > >> >rely
> > > > >> >> > on the
> > > > >> >> > > > >> > > > ecosystem
> > > > >> >> > > > >> > > > > of connectors in that Apache Project to use as
> > > > >the
> > > > >> >> > foundations
> > > > >> >> > > > >> for a
> > > > >> >> > > > >> > > > > generic transfer operator.
> > > > >> >> > > > >> > > > >
> > > > >> >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > > > >> >> > > > >> > Jarek.Potiuk@polidea.com>
> > > > >> >> > > > >> > > > > wrote:
> > > > >> >> > > > >> > > > >
> > > > >> >> > > > >> > > > > > +1
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil
> Olszewski
> > > > ><
> > > > >> >> > > > >> > > > > > kamil.olszewski@polidea.com>
> > > > >> >> > > > >> > > > > > wrote:
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > > > > Hello all,
> > > > >> >> > > > >> > > > > > > since there have been no new comments
> shared
> > > > >in
> > > > >> >the POC
> > > > >> >> > doc
> > > > >> >> > > > >> > > > > > > <
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > >
> > > > >> >> > > > >> > > >
> > > > >> >> > > > >> >
> > > > >> >> > > > >>
> > > > >> >> > > >
> > > > >> >> >
> > > > >>
> > > > >>
> > > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > > >> >> > > > >> > > > > > > >
> > > > >> >> > > > >> > > > > > > for a couple of days, then I will proceed
> > > > >with
> > > > >> >creating
> > > > >> >> > an
> > > > >> >> > > > AIP
> > > > >> >> > > > >> > for
> > > > >> >> > > > >> > > > this
> > > > >> >> > > > >> > > > > > > feature, if that is ok with everybody.
> > > > >> >> > > > >> > > > > > > Best regards,
> > > > >> >> > > > >> > > > > > > Kamil
> > > > >> >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz
> > > > >Urbaszek
> > > > >> ><
> > > > >> >> > > > >> > > > turbaszek@apache.org
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > > > > wrote:
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > > > > I like the approach as it itnroduces
> > > > >another
> > > > >> >> > interesting
> > > > >> >> > > > >> > operators'
> > > > >> >> > > > >> > > > > > > > interface standarization. It would be
> > > > >awesome
> > > > >> >to here
> > > > >> >> > more
> > > > >> >> > > > >> > opinions
> > > > >> >> > > > >> > > > > :)
> > > > >> >> > > > >> > > > > > > >
> > > > >> >> > > > >> > > > > > > > Cheers,
> > > > >> >> > > > >> > > > > > > > Tomek
> > > > >> >> > > > >> > > > > > > >
> > > > >> >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek
> > > > >Potiuk <
> > > > >> >> > > > >> > > > > Jarek.Potiuk@polidea.com
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > > > > wrote:
> > > > >> >> > > > >> > > > > > > >
> > > > >> >> > > > >> > > > > > > > > I like the idea a lot. Similar things
> > > > >have
> > > > >> >been
> > > > >> >> > > > discussed
> > > > >> >> > > > >> > before
> > > > >> >> > > > >> > > > > but
> > > > >> >> > > > >> > > > > > > the
> > > > >> >> > > > >> > > > > > > > > proposal is I think rather pragmatic
> and
> > > > >> >solves a
> > > > >> >> > real
> > > > >> >> > > > >> > problem
> > > > >> >> > > > >> > > > (and
> > > > >> >> > > > >> > > > > > it
> > > > >> >> > > > >> > > > > > > > does
> > > > >> >> > > > >> > > > > > > > > not seem to be too complex to
> implement)
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > There is some discussion about it
> > already
> > > > >in
> > > > >> >the
> > > > >> >> > > > document
> > > > >> >> > > > >> > (please
> > > > >> >> > > > >> > > > > > > > chime-in
> > > > >> >> > > > >> > > > > > > > > for those interested) but here a few
> > > > >points
> > > > >> >why I
> > > > >> >> > like
> > > > >> >> > > > it:
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > - performance and optimization is not
> a
> > > > >> >focus for
> > > > >> >> > that.
> > > > >> >> > > > >> For
> > > > >> >> > > > >> > > > generic
> > > > >> >> > > > >> > > > > > > stuff
> > > > >> >> > > > >> > > > > > > > > it is usually to write "optimal"
> > solution
> > > > >> >but once
> > > > >> >> > you
> > > > >> >> > > > >> admit
> > > > >> >> > > > >> > you
> > > > >> >> > > > >> > > > > are
> > > > >> >> > > > >> > > > > > > not
> > > > >> >> > > > >> > > > > > > > > going to focus for optimisation, you
> > come
> > > > >> >with
> > > > >> >> > simpler
> > > > >> >> > > > and
> > > > >> >> > > > >> > easier
> > > > >> >> > > > >> > > > > to
> > > > >> >> > > > >> > > > > > > use
> > > > >> >> > > > >> > > > > > > > > solutions
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > - on the other hand - it uses very
> > > > >> >"Python'y"
> > > > >> >> > approach
> > > > >> >> > > > >> with
> > > > >> >> > > > >> > using
> > > > >> >> > > > >> > > > > > > > > Airflow's familiar concepts
> (connection,
> > > > >> >transfer)
> > > > >> >> > and
> > > > >> >> > > > has
> > > > >> >> > > > >> > the
> > > > >> >> > > > >> > > > > > > potential
> > > > >> >> > > > >> > > > > > > > of
> > > > >> >> > > > >> > > > > > > > > plugging in into 100s of hooks we have
> > > > >> >already
> > > > >> >> > easily -
> > > > >> >> > > > >> > > > leveraging
> > > > >> >> > > > >> > > > > > all
> > > > >> >> > > > >> > > > > > > > the
> > > > >> >> > > > >> > > > > > > > > "providers" richness of Airflow.
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > - it aims to be easy to do "quick
> start"
> > > > >-
> > > > >> >if you
> > > > >> >> > have a
> > > > >> >> > > > >> > number
> > > > >> >> > > > >> > > > of
> > > > >> >> > > > >> > > > > > > > > different sources/targets and as a
> data
> > > > >> >scientist
> > > > >> >> > you
> > > > >> >> > > > >> would
> > > > >> >> > > > >> > like
> > > > >> >> > > > >> > > > to
> > > > >> >> > > > >> > > > > > > > quickly
> > > > >> >> > > > >> > > > > > > > > start transferring data between them -
> > > > >you
> > > > >> >can do it
> > > > >> >> > > > >> easily
> > > > >> >> > > > >> > with
> > > > >> >> > > > >> > > > > > only
> > > > >> >> > > > >> > > > > > > > > basic python knowledge and simple DAG
> > > > >> >structure.
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > - it should be possible to plug it in
> > > > >into
> > > > >> >our new
> > > > >> >> > > > >> functional
> > > > >> >> > > > >> > > > > > approach
> > > > >> >> > > > >> > > > > > > as
> > > > >> >> > > > >> > > > > > > > > well as future lineage discussions as
> it
> > > > >> >makes
> > > > >> >> > > > connection
> > > > >> >> > > > >> > between
> > > > >> >> > > > >> > > > > > > sources
> > > > >> >> > > > >> > > > > > > > > and targets
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > - it opens up possibilities of adding
> > > > >simple
> > > > >> >and
> > > > >> >> > > > flexible
> > > > >> >> > > > >> > data
> > > > >> >> > > > >> > > > > > > > > transformation on-transfer. Not a
> > > > >> >replacement for
> > > > >> >> > any of
> > > > >> >> > > > >> the
> > > > >> >> > > > >> > > > > external
> > > > >> >> > > > >> > > > > > > > > services that Airflow should use
> > (Airflow
> > > > >is
> > > > >> >an
> > > > >> >> > > > >> > orchestrator, not
> > > > >> >> > > > >> > > > > > data
> > > > >> >> > > > >> > > > > > > > > processing solution) but for the kind
> of
> > > > >> >quick-start
> > > > >> >> > > > >> > scenarios I
> > > > >> >> > > > >> > > > > > > foresee
> > > > >> >> > > > >> > > > > > > > it
> > > > >> >> > > > >> > > > > > > > > might be most useful, being able to
> > apply
> > > > >> >simple
> > > > >> >> > data
> > > > >> >> > > > >> > > > > transformation
> > > > >> >> > > > >> > > > > > on
> > > > >> >> > > > >> > > > > > > > the
> > > > >> >> > > > >> > > > > > > > > fly by data scientist might be a big
> > > > >plus.
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the
> > format
> > > > >of
> > > > >> >the
> > > > >> >> > "data"
> > > > >> >> > > > >> > component
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > Kamil - you should have access now.
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > J.
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
> > > > >> >Olszewski <
> > > > >> >> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
> > > > >> >> > > > >> > > > > > > > > wrote:
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > > Hello all,
> > > > >> >> > > > >> > > > > > > > > > in Polidea we have come up with an
> > idea
> > > > >> >for a
> > > > >> >> > generic
> > > > >> >> > > > >> > transfer
> > > > >> >> > > > >> > > > > > > operator
> > > > >> >> > > > >> > > > > > > > > > that would be able to transport data
> > > > >> >between two
> > > > >> >> > > > >> > destinations
> > > > >> >> > > > >> > > > of
> > > > >> >> > > > >> > > > > > > > various
> > > > >> >> > > > >> > > > > > > > > > types (file, database, storage,
> etc.)
> > -
> > > > >> >please
> > > > >> >> > find
> > > > >> >> > > > the
> > > > >> >> > > > >> > link
> > > > >> >> > > > >> > > > > with a
> > > > >> >> > > > >> > > > > > > > short
> > > > >> >> > > > >> > > > > > > > > > doc with POC
> > > > >> >> > > > >> > > > > > > > > > <
> > > > >> >> > > > >> > > > > > > > > >
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > >
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > >
> > > > >> >> > > > >> > > >
> > > > >> >> > > > >> >
> > > > >> >> > > > >>
> > > > >> >> > > >
> > > > >> >> >
> > > > >>
> > > > >>
> > > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > >> >> > > > >> > > > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > > where we can discuss the design
> > > > >initially.
> > > > >> >Once we
> > > > >> >> > > > come
> > > > >> >> > > > >> to
> > > > >> >> > > > >> > the
> > > > >> >> > > > >> > > > > > > initial
> > > > >> >> > > > >> > > > > > > > > > conclusion I can create an AIP on
> > cWiki
> > > > >-
> > > > >> >can I
> > > > >> >> > ask
> > > > >> >> > > > for
> > > > >> >> > > > >> > > > > permission
> > > > >> >> > > > >> > > > > > to
> > > > >> >> > > > >> > > > > > > > do
> > > > >> >> > > > >> > > > > > > > > so
> > > > >> >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I
> > believe
> > > > >> >that
> > > > >> >> > during
> > > > >> >> > > > the
> > > > >> >> > > > >> > > > > discussion
> > > > >> >> > > > >> > > > > > we
> > > > >> >> > > > >> > > > > > > > > > should definitely aim for this
> feature
> > > > >to
> > > > >> >be
> > > > >> >> > released
> > > > >> >> > > > >> only
> > > > >> >> > > > >> > > > after
> > > > >> >> > > > >> > > > > > > > Airflow
> > > > >> >> > > > >> > > > > > > > > > 2.0 is out.
> > > > >> >> > > > >> > > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > > What do you think about this idea?
> > > > >Would
> > > > >> >you find
> > > > >> >> > such
> > > > >> >> > > > >> an
> > > > >> >> > > > >> > > > > operator
> > > > >> >> > > > >> > > > > > > > > helpful
> > > > >> >> > > > >> > > > > > > > > > in your pipelines? Maybe you already
> > > > >use a
> > > > >> >similar
> > > > >> >> > > > >> > solution or
> > > > >> >> > > > >> > > > > know
> > > > >> >> > > > >> > > > > > > > > > packages that could be used to
> > > > >implement
> > > > >> >it?
> > > > >> >> > > > >> > > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > > Best regards,
> > > > >> >> > > > >> > > > > > > > > > --
> > > > >> >> > > > >> > > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > > Kamil Olszewski
> > > > >> >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
> > > > >> >Software
> > > > >> >> > Engineer
> > > > >> >> > > > >> > > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > > M: +48 503 361 783
> > > > >> >> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> > > > >> >> > > > >> > > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > > Unique Tech
> > > > >> >> > > > >> > > > > > > > > > Check out our projects! <
> > > > >> >> > > > >> https://www.polidea.com/our-work>
> > > > >> >> > > > >> > > > > > > > > >
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > --
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > Jarek Potiuk
> > > > >> >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
> > > > >> >Principal
> > > > >> >> > Software
> > > > >> >> > > > >> > Engineer
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > > > >> >> > > > >> > > > > > > > > [image: Polidea]
> > > > ><https://www.polidea.com/>
> > > > >> >> > > > >> > > > > > > > >
> > > > >> >> > > > >> > > > > > > >
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > > > --
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > > > Kamil Olszewski
> > > > >> >> > > > >> > > > > > > Polidea <https://www.polidea.com> |
> > Software
> > > > >> >Engineer
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > > > M: +48 503 361 783
> > > > >> >> > > > >> > > > > > > E: kamil.olszewski@polidea.com
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > > > Unique Tech
> > > > >> >> > > > >> > > > > > > Check out our projects! <
> > > > >> >> > https://www.polidea.com/our-work>
> > > > >> >> > > > >> > > > > > >
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > > > --
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > > > Jarek Potiuk
> > > > >> >> > > > >> > > > > > Polidea <https://www.polidea.com/> |
> > Principal
> > > > >> >Software
> > > > >> >> > > > >> Engineer
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
> > > > >> >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >> >> > > > >> > > > > >
> > > > >> >> > > > >> > > > >
> > > > >> >> > > > >> >
> > > > >> >> > > > >> >
> > > > >> >> > > > >> >
> > > > >> >> > > > >> > --
> > > > >> >> > > > >> >
> > > > >> >> > > > >> > Tomasz Urbaszek
> > > > >> >> > > > >> > Polidea | Software Engineer
> > > > >> >> > > > >> >
> > > > >> >> > > > >> > M: +48 505 628 493
> > > > >> >> > > > >> > E: tomasz.urbaszek@polidea.com
> > > > >> >> > > > >> >
> > > > >> >> > > > >> > Unique Tech
> > > > >> >> > > > >> > Check out our projects!
> > > > >> >> > > > >> >
> > > > >> >> > > > >>
> > > > >> >> > > > >
> > > > >> >> > > > >
> > > > >> >> > > > > --
> > > > >> >> > > > >
> > > > >> >> > > > > Jarek Potiuk
> > > > >> >> > > > > Polidea <https://www.polidea.com/> | Principal
> Software
> > > > >> >Engineer
> > > > >> >> > > > >
> > > > >> >> > > > > M: +48 660 796 129 <+48660796129>
> > > > >> >> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >> >> > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > > > --
> > > > >> >> > > >
> > > > >> >> > > > Jarek Potiuk
> > > > >> >> > > > Polidea <https://www.polidea.com/> | Principal Software
> > > > >> >Engineer
> > > > >> >> > > >
> > > > >> >> > > > M: +48 660 796 129 <+48660796129>
> > > > >> >> > > > [image: Polidea] <https://www.polidea.com/>
> > > > >> >> > > >
> > > > >> >> >
> > > >
> >
> >
> >
> > --
> >
> > Tomasz Urbaszek
> > Polidea | Software Engineer
> >
> > M: +48 505 628 493
> > E: tomasz.urbaszek@polidea.com
> >
> > Unique Tech
> > Check out our projects!
>
>
>
> --
>
> Kamil Olszewski
> Polidea <https://www.polidea.com> | Software Engineer
>
> M: +48 503 361 783
> E: kamil.olszewski@polidea.com
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>

Re: Generic Transfer Operator

Posted by Kamil Olszewski <ka...@polidea.com>.
I think the biggest downsides were already mentioned by Tomek: more
dependency management when using apache-beam (plus possibility of conflicts
between dependencies of beam and airflow) and no support for data lineage
solutions. Besides, we create a higher entry threshold by creating a
necessity to understand both beam and airflow concepts. That's why I am
also in favor of considering both generic and beam approaches. Maybe we
will be able adapt some concepts for generic approach from beam without
creating a direct dependency. If no one is against it, I will try to take a
closer look at Beam concepts and create an AIP next week.

Kamil

On Mon, Sep 7, 2020 at 3:54 PM Daniel Imberman <da...@gmail.com>
wrote:

> Ok that’s awesome. I’m also seeing that they have an s3 IO setting [
> https://beam.apache.org/releases/pydoc/2.23.0/apache_beam.io.aws.s3io.html]
> . Seems that if it’s just a pip install we could start out with just File
> (I imagine on kubernetes this could even work with volume mounts) and S3,
> and then add more as time goes on? Are there any downsides with us tying
> this into Beam? (e.g. if we want to use a storage system not yet supported
> by beam).
> via Newton Mail [
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> ]
> On Sun, Sep 6, 2020 at 1:24 PM, Tomasz Urbaszek <tu...@apache.org>
> wrote:
> I checked it with our Beam team and DirectRunner is supported by
> Python SDK and requires no JVM. That's the main reason I think it's
> worth considering it :) Hard dependency od JVM would be probably a
> no-go for us.
> https://beam.apache.org/documentation/runners/direct/
>
> Tomek
>
>
> On Sun, Sep 6, 2020 at 9:45 PM Daniel Imberman
> <da...@gmail.com> wrote:
> >
> > Oof ok yeah. I hadn't realized that beam had a hard JVM requirement. I
> > think that initially offering a local or block storage based solution
> with
> > easy extensions for users is totally in line with airflow philosophy. I
> > think that offering alternative transfer operators inproviders is a great
> > idea!
> >
> > On Sun, Sep 6, 2020, 9:07 AM Ash Berlin-Taylor <as...@apache.org> wrote:
> >
> > > No strong opinion - but it seems like generic is the easiest for us to
> > > code (as we have most of it already via hooks?) and adopt (and doesn't
> > > place a hard requirement on Beam/JVM, even if JVM would only be
> runtime.
> > > Still)
> > >
> > > This is possibly where Airflow has a core TransferOperator, and
> > > providers.apache.beam.operators.BeamTransferOperator? If the "same"
> python
> > > API could be used for both, and it doesn't needlessly complicated
> things.
> > >
> > > -a
> > >
> > > On 6 September 2020 16:20:37 BST, Tomasz Urbaszek <
> turbaszek@apache.org>
> > > wrote:
> > > >Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This
> > > >one looks really interesting for blob storages transfer!
> > > >
> > > >As stated in the initial design doc I don't think we should focus on
> > > >best performance but rather on versatility. Currently, we have many
> > > >AtoB operators that do not yield the highest performance but do their
> > > >work and are widely used.
> > > >
> > > >I would say that we should prepare an AIP that will propose two
> > > >approaches: generic vs beam. This will allow us to compare them and
> > > >then we can vote which one is better from the Airflow community
> > > >perspective.
> > > >
> > > >What do you think?
> > > >
> > > >Tomek
> > > >
> > > >
> > > >On Sun, Sep 6, 2020 at 2:42 PM Ash Berlin-Taylor <as...@apache.org>
> > > >wrote:
> > > >>
> > > >> For background: in the past I had an S3 to S3 transfer using
> > > >smartopen (since we wanted to split one giant ~300GB file onto smaller
> > > >parts) and it took about 10mins, so even "large" uses can work fine in
> > > >Airflow - no JVM required.
> > > >>
> > > >> -ash
> > > >>
> > > >> On 6 September 2020 12:01:24 BST, Tomasz Urbaszek
> > > ><tu...@apache.org> wrote:
> > > >> >I think using direct runner as default with the option to specify
> > > >> >other setup is a win-win. However, there are few doubts I have
> about
> > > >> >Beam based approach:
> > > >> >
> > > >> >1. Dependency management. If I do `pip install apache-airflow[gcp]`
> > > >> >will it install `apache-beam[gcp]`? What if there's a version clash
> > > >> >between dependencies?
> > > >> >
> > > >> >2. The initial approach using `DataSource` concept allowed users to
> > > >> >use it in any operator (not only transfer ones). In case of relying
> > > >on
> > > >> >Beam we are losing this.
> > > >> >
> > > >> >3. I'm not a Beam expert but it seems to not support any data
> > > >lineage
> > > >> >solution?
> > > >> >
> > > >> >
> > > >> >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
> > > >> ><da...@gmail.com> wrote:
> > > >> >>
> > > >> >> I think there are absolutely use-cases for both. I’m totally fine
> > > >> >with saying “for small/medium use-cases, we come with an in-house
> > > >> >system. However for larger cases, you’ll require spark/Flink/S3.
> > > >That’s
> > > >> >totally in line with PLENTY of use-cases. This would be especially
> > > >cool
> > > >> >when matched with fast-follow as we could EVEN potentially tie in
> > > >data
> > > >> >locality.
> > > >> >>
> > > >> >> via Newton Mail
> > > >>
> > > >>[
> > >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > ]
> > > >> >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
> > > >> ><wh...@gmail.com> wrote:
> > > >> >> I believe - for not large data - the direct runner is wholly
> > > >doable,
> > > >> >which
> > > >> >> seems in line with airflow patterns. I have, and have spoken with
> > > >> >several
> > > >> >> others that have, been productive with that runner.
> > > >> >>
> > > >> >> For much larger transfers, the generic operator could accept
> > > >> >parameters for
> > > >> >> submitting the compute to an actual runner. Though, imagining
> that
> > > >> >> (needing a runner) would not be the primary use case for such an
> > > >> >operator.
> > > >> >>
> > > >> >>
> > > >> >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek
> > > ><tu...@apache.org>
> > > >> >wrote:
> > > >> >>
> > > >> >> > Austin, you are right, Beam covers all (and more) important
> IOs.
> > > >> >> > However, using Apache Beam to design a generic transfer
> operator
> > > >> >> > requires Airflow users to have additional resources that will
> be
> > > >> >used
> > > >> >> > as a runner (Spark, Flink, etc.). Unless you suggest using
> > > >> >> > DirectRunner?
> > > >> >> >
> > > >> >> > Can you please tell us more how exactly you think we can use
> > > >Beam
> > > >> >for
> > > >> >> > those Airflow transfer operators?
> > > >> >> >
> > > >> >> > Best,
> > > >> >> > Tomek
> > > >> >> >
> > > >> >> >
> > > >> >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> > > >> >> > <wh...@gmail.com> wrote:
> > > >> >> > >
> > > >> >> > > Are there IOs that would be desired for a generic transfer
> > > >> >operator that
> > > >> >> > > don't exist in:
> > > >> >https://beam.apache.org/documentation/io/built-in/ <-
> > > >> >> > > there is pretty solid coverage?
> > > >> >> > >
> > > >> >> > > Beam is getting to the point where even python beam can
> > > >leverage
> > > >> >the java
> > > >> >> > > IOs, which increases the range of IOs (and performance).
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
> > > >> ><Ja...@polidea.com>
> > > >> >> > > wrote:
> > > >> >> > >
> > > >> >> > > > But I believe those two ideas are separate ones as Tomek
> > > >> >explained :)
> > > >> >> > > >
> > > >> >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
> > > >> ><Jarek.Potiuk@polidea.com
> > > >> >> > >
> > > >> >> > > > wrote:
> > > >> >> > > >
> > > >> >> > > > > I love the idea of connecting the projects more closely!
> > > >> >> > > > >
> > > >> >> > > > > I've been helping recently as a consultant in improving
> > > >the
> > > >> >Apache
> > > >> >> > Beam
> > > >> >> > > > > build infrastructure (in many parts based on my Airflow
> > > >> >experience
> > > >> >> > and
> > > >> >> > > > > Github Actions - even recently they adopted the "cancel"
> > > >> >action I
> > > >> >> > > > developed
> > > >> >> > > > > for Apache Airflow).
> > > >> >https://github.com/apache/beam/pull/12729
> > > >> >> > > > >
> > > >> >> > > > > Synergies in Apache projects are cool.
> > > >> >> > > > >
> > > >> >> > > > > J.
> > > >> >> > > > >
> > > >> >> > > > >
> > > >> >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > > >> >> > > > > <gc...@twitter.com.invalid> wrote:
> > > >> >> > > > >
> > > >> >> > > > >> Agree on keeping those separate, just intervened as I
> > > >> >believe its a
> > > >> >> > > > great
> > > >> >> > > > >> idea. But lets keep @beam and @spark to a separate
> > > >thread.
> > > >> >> > > > >>
> > > >> >> > > > >>
> > > >> >> > > > >> Gerard Casas Saez
> > > >> >> > > > >> Twitter | Cortex | @casassaez
> > > ><http://twitter.com/casassaez>
> > > >> >> > > > >>
> > > >> >> > > > >>
> > > >> >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> > > >> >> > turbaszek@apache.org>
> > > >> >> > > > >> wrote:
> > > >> >> > > > >>
> > > >> >> > > > >> > Daniel is right we have few Apache Beam committers in
> > > >> >Polidea so
> > > >> >> > we
> > > >> >> > > > >> > will ask for advice. However, I would be highly in
> > > >favor
> > > >> >of
> > > >> >> > having it
> > > >> >> > > > >> > as Gerard suggested as @beam decorator. This is
> > > >something
> > > >> >we
> > > >> >> > should
> > > >> >> > > > >> > put into another AIP together with the mentioned
> @spark
> > > >> >decorator.
> > > >> >> > > > >> >
> > > >> >> > > > >> > Our proposition of transfer operators was mainly to
> > > >create
> > > >> >> > something
> > > >> >> > > > >> > Airflow-native that works out of the box and allows us
> > > >to
> > > >> >simplify
> > > >> >> > > > >> > read/write from external sources. Thus, it requires no
> > > >> >external
> > > >> >> > > > >> > dependency other than the library to communicate with
> > > >the
> > > >> >API. In
> > > >> >> > the
> > > >> >> > > > >> > case of Beam we need more than that I think.
> > > >> >> > > > >> >
> > > >> >> > > > >> > Additionally, the ideas of Source and Destination play
> > > >> >nicely with
> > > >> >> > > > >> > data lineage and may bring more interest to this
> > > >feature
> > > >> >of
> > > >> >> > Airflow.
> > > >> >> > > > >> >
> > > >> >> > > > >> > Cheers,
> > > >> >> > > > >> > Tomek
> > > >> >> > > > >> >
> > > >> >> > > > >> >
> > > >> >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
> > > >> ><ka...@gmail.com>
> > > >> >> > > > wrote:
> > > >> >> > > > >> > >
> > > >> >> > > > >> > > Nice. Just a note here, we will need to make sure
> > > >that
> > > >> >those
> > > >> >> > > > "Source"
> > > >> >> > > > >> and
> > > >> >> > > > >> > > "Destination" needs to be serializable.
> > > >> >> > > > >> > >
> > > >> >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > > >> >> > > > daniel.imberman@gmail.com
> > > >> >> > > > >> >
> > > >> >> > > > >> > > wrote:
> > > >> >> > > > >> > >
> > > >> >> > > > >> > > > Interesting! Beam also could potentially allow
> > > >> >transfers
> > > >> >> > within
> > > >> >> > > > >> > Dask/any
> > > >> >> > > > >> > > > other system with a java/python SDK? I think
> @jarek
> > > >> >and
> > > >> >> > Polidea
> > > >> >> > > > do a
> > > >> >> > > > >> > lot of
> > > >> >> > > > >> > > > work with Beam as well so I’d love their thoughts
> > > >if
> > > >> >this a
> > > >> >> > good
> > > >> >> > > > >> > use-case.
> > > >> >> > > > >> > > >
> > > >> >> > > > >> > > > via Newton Mail [
> > > >> >> > > > >> > > >
> > > >> >> > > > >> >
> > > >> >> > > > >>
> > > >> >> > > >
> > > >> >> >
> > > >>
> > > >>
> > >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > >> >> > > > >> > > > ]
> > > >> >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez
> > > ><
> > > >> >> > > > >> > gcasassaez@twitter.com.invalid>
> > > >> >> > > > >> > > > wrote:
> > > >> >> > > > >> > > > I would be highly in favour of having a generic
> > > >Beam
> > > >> >operator.
> > > >> >> > > > >> Similar
> > > >> >> > > > >> > > > to @spark_task decorator. Something where you can
> > > >> >easily
> > > >> >> > define
> > > >> >> > > > and
> > > >> >> > > > >> > wrap a
> > > >> >> > > > >> > > > beam pipeline and convert it to an Airflow
> > > >operator.
> > > >> >> > > > >> > > >
> > > >> >> > > > >> > > > Gerard Casas Saez
> > > >> >> > > > >> > > > Twitter | Cortex | @casassaez
> > > >> ><http://twitter.com/casassaez>
> > > >> >> > > > >> > > >
> > > >> >> > > > >> > > >
> > > >> >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > > >> >> > > > >> > > > whatwouldaustindo@gmail.com>
> > > >> >> > > > >> > > > wrote:
> > > >> >> > > > >> > > >
> > > >> >> > > > >> > > > > Are you guys familiar with Beam
> > > >> ><https://beam.apache.org>?
> > > >> >> > Esp.
> > > >> >> > > > >> if
> > > >> >> > > > >> > not
> > > >> >> > > > >> > > > > doing transforms, it might rather
> straightforward
> > > >to
> > > >> >rely
> > > >> >> > on the
> > > >> >> > > > >> > > > ecosystem
> > > >> >> > > > >> > > > > of connectors in that Apache Project to use as
> > > >the
> > > >> >> > foundations
> > > >> >> > > > >> for a
> > > >> >> > > > >> > > > > generic transfer operator.
> > > >> >> > > > >> > > > >
> > > >> >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > > >> >> > > > >> > Jarek.Potiuk@polidea.com>
> > > >> >> > > > >> > > > > wrote:
> > > >> >> > > > >> > > > >
> > > >> >> > > > >> > > > > > +1
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski
> > > ><
> > > >> >> > > > >> > > > > > kamil.olszewski@polidea.com>
> > > >> >> > > > >> > > > > > wrote:
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > > > > Hello all,
> > > >> >> > > > >> > > > > > > since there have been no new comments shared
> > > >in
> > > >> >the POC
> > > >> >> > doc
> > > >> >> > > > >> > > > > > > <
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > >
> > > >> >> > > > >> > > >
> > > >> >> > > > >> >
> > > >> >> > > > >>
> > > >> >> > > >
> > > >> >> >
> > > >>
> > > >>
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > >> >> > > > >> > > > > > > >
> > > >> >> > > > >> > > > > > > for a couple of days, then I will proceed
> > > >with
> > > >> >creating
> > > >> >> > an
> > > >> >> > > > AIP
> > > >> >> > > > >> > for
> > > >> >> > > > >> > > > this
> > > >> >> > > > >> > > > > > > feature, if that is ok with everybody.
> > > >> >> > > > >> > > > > > > Best regards,
> > > >> >> > > > >> > > > > > > Kamil
> > > >> >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz
> > > >Urbaszek
> > > >> ><
> > > >> >> > > > >> > > > turbaszek@apache.org
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > > > > wrote:
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > > > > I like the approach as it itnroduces
> > > >another
> > > >> >> > interesting
> > > >> >> > > > >> > operators'
> > > >> >> > > > >> > > > > > > > interface standarization. It would be
> > > >awesome
> > > >> >to here
> > > >> >> > more
> > > >> >> > > > >> > opinions
> > > >> >> > > > >> > > > > :)
> > > >> >> > > > >> > > > > > > >
> > > >> >> > > > >> > > > > > > > Cheers,
> > > >> >> > > > >> > > > > > > > Tomek
> > > >> >> > > > >> > > > > > > >
> > > >> >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek
> > > >Potiuk <
> > > >> >> > > > >> > > > > Jarek.Potiuk@polidea.com
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > > > > wrote:
> > > >> >> > > > >> > > > > > > >
> > > >> >> > > > >> > > > > > > > > I like the idea a lot. Similar things
> > > >have
> > > >> >been
> > > >> >> > > > discussed
> > > >> >> > > > >> > before
> > > >> >> > > > >> > > > > but
> > > >> >> > > > >> > > > > > > the
> > > >> >> > > > >> > > > > > > > > proposal is I think rather pragmatic and
> > > >> >solves a
> > > >> >> > real
> > > >> >> > > > >> > problem
> > > >> >> > > > >> > > > (and
> > > >> >> > > > >> > > > > > it
> > > >> >> > > > >> > > > > > > > does
> > > >> >> > > > >> > > > > > > > > not seem to be too complex to implement)
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > There is some discussion about it
> already
> > > >in
> > > >> >the
> > > >> >> > > > document
> > > >> >> > > > >> > (please
> > > >> >> > > > >> > > > > > > > chime-in
> > > >> >> > > > >> > > > > > > > > for those interested) but here a few
> > > >points
> > > >> >why I
> > > >> >> > like
> > > >> >> > > > it:
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > - performance and optimization is not a
> > > >> >focus for
> > > >> >> > that.
> > > >> >> > > > >> For
> > > >> >> > > > >> > > > generic
> > > >> >> > > > >> > > > > > > stuff
> > > >> >> > > > >> > > > > > > > > it is usually to write "optimal"
> solution
> > > >> >but once
> > > >> >> > you
> > > >> >> > > > >> admit
> > > >> >> > > > >> > you
> > > >> >> > > > >> > > > > are
> > > >> >> > > > >> > > > > > > not
> > > >> >> > > > >> > > > > > > > > going to focus for optimisation, you
> come
> > > >> >with
> > > >> >> > simpler
> > > >> >> > > > and
> > > >> >> > > > >> > easier
> > > >> >> > > > >> > > > > to
> > > >> >> > > > >> > > > > > > use
> > > >> >> > > > >> > > > > > > > > solutions
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > - on the other hand - it uses very
> > > >> >"Python'y"
> > > >> >> > approach
> > > >> >> > > > >> with
> > > >> >> > > > >> > using
> > > >> >> > > > >> > > > > > > > > Airflow's familiar concepts (connection,
> > > >> >transfer)
> > > >> >> > and
> > > >> >> > > > has
> > > >> >> > > > >> > the
> > > >> >> > > > >> > > > > > > potential
> > > >> >> > > > >> > > > > > > > of
> > > >> >> > > > >> > > > > > > > > plugging in into 100s of hooks we have
> > > >> >already
> > > >> >> > easily -
> > > >> >> > > > >> > > > leveraging
> > > >> >> > > > >> > > > > > all
> > > >> >> > > > >> > > > > > > > the
> > > >> >> > > > >> > > > > > > > > "providers" richness of Airflow.
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > - it aims to be easy to do "quick start"
> > > >-
> > > >> >if you
> > > >> >> > have a
> > > >> >> > > > >> > number
> > > >> >> > > > >> > > > of
> > > >> >> > > > >> > > > > > > > > different sources/targets and as a data
> > > >> >scientist
> > > >> >> > you
> > > >> >> > > > >> would
> > > >> >> > > > >> > like
> > > >> >> > > > >> > > > to
> > > >> >> > > > >> > > > > > > > quickly
> > > >> >> > > > >> > > > > > > > > start transferring data between them -
> > > >you
> > > >> >can do it
> > > >> >> > > > >> easily
> > > >> >> > > > >> > with
> > > >> >> > > > >> > > > > > only
> > > >> >> > > > >> > > > > > > > > basic python knowledge and simple DAG
> > > >> >structure.
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > - it should be possible to plug it in
> > > >into
> > > >> >our new
> > > >> >> > > > >> functional
> > > >> >> > > > >> > > > > > approach
> > > >> >> > > > >> > > > > > > as
> > > >> >> > > > >> > > > > > > > > well as future lineage discussions as it
> > > >> >makes
> > > >> >> > > > connection
> > > >> >> > > > >> > between
> > > >> >> > > > >> > > > > > > sources
> > > >> >> > > > >> > > > > > > > > and targets
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > - it opens up possibilities of adding
> > > >simple
> > > >> >and
> > > >> >> > > > flexible
> > > >> >> > > > >> > data
> > > >> >> > > > >> > > > > > > > > transformation on-transfer. Not a
> > > >> >replacement for
> > > >> >> > any of
> > > >> >> > > > >> the
> > > >> >> > > > >> > > > > external
> > > >> >> > > > >> > > > > > > > > services that Airflow should use
> (Airflow
> > > >is
> > > >> >an
> > > >> >> > > > >> > orchestrator, not
> > > >> >> > > > >> > > > > > data
> > > >> >> > > > >> > > > > > > > > processing solution) but for the kind of
> > > >> >quick-start
> > > >> >> > > > >> > scenarios I
> > > >> >> > > > >> > > > > > > foresee
> > > >> >> > > > >> > > > > > > > it
> > > >> >> > > > >> > > > > > > > > might be most useful, being able to
> apply
> > > >> >simple
> > > >> >> > data
> > > >> >> > > > >> > > > > transformation
> > > >> >> > > > >> > > > > > on
> > > >> >> > > > >> > > > > > > > the
> > > >> >> > > > >> > > > > > > > > fly by data scientist might be a big
> > > >plus.
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the
> format
> > > >of
> > > >> >the
> > > >> >> > "data"
> > > >> >> > > > >> > component
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > Kamil - you should have access now.
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > J.
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
> > > >> >Olszewski <
> > > >> >> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
> > > >> >> > > > >> > > > > > > > > wrote:
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > > Hello all,
> > > >> >> > > > >> > > > > > > > > > in Polidea we have come up with an
> idea
> > > >> >for a
> > > >> >> > generic
> > > >> >> > > > >> > transfer
> > > >> >> > > > >> > > > > > > operator
> > > >> >> > > > >> > > > > > > > > > that would be able to transport data
> > > >> >between two
> > > >> >> > > > >> > destinations
> > > >> >> > > > >> > > > of
> > > >> >> > > > >> > > > > > > > various
> > > >> >> > > > >> > > > > > > > > > types (file, database, storage, etc.)
> -
> > > >> >please
> > > >> >> > find
> > > >> >> > > > the
> > > >> >> > > > >> > link
> > > >> >> > > > >> > > > > with a
> > > >> >> > > > >> > > > > > > > short
> > > >> >> > > > >> > > > > > > > > > doc with POC
> > > >> >> > > > >> > > > > > > > > > <
> > > >> >> > > > >> > > > > > > > > >
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > >
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > >
> > > >> >> > > > >> > > >
> > > >> >> > > > >> >
> > > >> >> > > > >>
> > > >> >> > > >
> > > >> >> >
> > > >>
> > > >>
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > >> >> > > > >> > > > > > > > > > >
> > > >> >> > > > >> > > > > > > > > > where we can discuss the design
> > > >initially.
> > > >> >Once we
> > > >> >> > > > come
> > > >> >> > > > >> to
> > > >> >> > > > >> > the
> > > >> >> > > > >> > > > > > > initial
> > > >> >> > > > >> > > > > > > > > > conclusion I can create an AIP on
> cWiki
> > > >-
> > > >> >can I
> > > >> >> > ask
> > > >> >> > > > for
> > > >> >> > > > >> > > > > permission
> > > >> >> > > > >> > > > > > to
> > > >> >> > > > >> > > > > > > > do
> > > >> >> > > > >> > > > > > > > > so
> > > >> >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I
> believe
> > > >> >that
> > > >> >> > during
> > > >> >> > > > the
> > > >> >> > > > >> > > > > discussion
> > > >> >> > > > >> > > > > > we
> > > >> >> > > > >> > > > > > > > > > should definitely aim for this feature
> > > >to
> > > >> >be
> > > >> >> > released
> > > >> >> > > > >> only
> > > >> >> > > > >> > > > after
> > > >> >> > > > >> > > > > > > > Airflow
> > > >> >> > > > >> > > > > > > > > > 2.0 is out.
> > > >> >> > > > >> > > > > > > > > >
> > > >> >> > > > >> > > > > > > > > > What do you think about this idea?
> > > >Would
> > > >> >you find
> > > >> >> > such
> > > >> >> > > > >> an
> > > >> >> > > > >> > > > > operator
> > > >> >> > > > >> > > > > > > > > helpful
> > > >> >> > > > >> > > > > > > > > > in your pipelines? Maybe you already
> > > >use a
> > > >> >similar
> > > >> >> > > > >> > solution or
> > > >> >> > > > >> > > > > know
> > > >> >> > > > >> > > > > > > > > > packages that could be used to
> > > >implement
> > > >> >it?
> > > >> >> > > > >> > > > > > > > > >
> > > >> >> > > > >> > > > > > > > > > Best regards,
> > > >> >> > > > >> > > > > > > > > > --
> > > >> >> > > > >> > > > > > > > > >
> > > >> >> > > > >> > > > > > > > > > Kamil Olszewski
> > > >> >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
> > > >> >Software
> > > >> >> > Engineer
> > > >> >> > > > >> > > > > > > > > >
> > > >> >> > > > >> > > > > > > > > > M: +48 503 361 783
> > > >> >> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> > > >> >> > > > >> > > > > > > > > >
> > > >> >> > > > >> > > > > > > > > > Unique Tech
> > > >> >> > > > >> > > > > > > > > > Check out our projects! <
> > > >> >> > > > >> https://www.polidea.com/our-work>
> > > >> >> > > > >> > > > > > > > > >
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > --
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > Jarek Potiuk
> > > >> >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
> > > >> >Principal
> > > >> >> > Software
> > > >> >> > > > >> > Engineer
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > > >> >> > > > >> > > > > > > > > [image: Polidea]
> > > ><https://www.polidea.com/>
> > > >> >> > > > >> > > > > > > > >
> > > >> >> > > > >> > > > > > > >
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > > > --
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > > > Kamil Olszewski
> > > >> >> > > > >> > > > > > > Polidea <https://www.polidea.com> |
> Software
> > > >> >Engineer
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > > > M: +48 503 361 783
> > > >> >> > > > >> > > > > > > E: kamil.olszewski@polidea.com
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > > > Unique Tech
> > > >> >> > > > >> > > > > > > Check out our projects! <
> > > >> >> > https://www.polidea.com/our-work>
> > > >> >> > > > >> > > > > > >
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > > > --
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > > > Jarek Potiuk
> > > >> >> > > > >> > > > > > Polidea <https://www.polidea.com/> |
> Principal
> > > >> >Software
> > > >> >> > > > >> Engineer
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
> > > >> >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > >> >> > > > >> > > > > >
> > > >> >> > > > >> > > > >
> > > >> >> > > > >> >
> > > >> >> > > > >> >
> > > >> >> > > > >> >
> > > >> >> > > > >> > --
> > > >> >> > > > >> >
> > > >> >> > > > >> > Tomasz Urbaszek
> > > >> >> > > > >> > Polidea | Software Engineer
> > > >> >> > > > >> >
> > > >> >> > > > >> > M: +48 505 628 493
> > > >> >> > > > >> > E: tomasz.urbaszek@polidea.com
> > > >> >> > > > >> >
> > > >> >> > > > >> > Unique Tech
> > > >> >> > > > >> > Check out our projects!
> > > >> >> > > > >> >
> > > >> >> > > > >>
> > > >> >> > > > >
> > > >> >> > > > >
> > > >> >> > > > > --
> > > >> >> > > > >
> > > >> >> > > > > Jarek Potiuk
> > > >> >> > > > > Polidea <https://www.polidea.com/> | Principal Software
> > > >> >Engineer
> > > >> >> > > > >
> > > >> >> > > > > M: +48 660 796 129 <+48660796129>
> > > >> >> > > > > [image: Polidea] <https://www.polidea.com/>
> > > >> >> > > > >
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > > > --
> > > >> >> > > >
> > > >> >> > > > Jarek Potiuk
> > > >> >> > > > Polidea <https://www.polidea.com/> | Principal Software
> > > >> >Engineer
> > > >> >> > > >
> > > >> >> > > > M: +48 660 796 129 <+48660796129>
> > > >> >> > > > [image: Polidea] <https://www.polidea.com/>
> > > >> >> > > >
> > > >> >> >
> > >
>
>
>
> --
>
> Tomasz Urbaszek
> Polidea | Software Engineer
>
> M: +48 505 628 493
> E: tomasz.urbaszek@polidea.com
>
> Unique Tech
> Check out our projects!



-- 

Kamil Olszewski
Polidea <https://www.polidea.com> | Software Engineer

M: +48 503 361 783
E: kamil.olszewski@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: Generic Transfer Operator

Posted by Daniel Imberman <da...@gmail.com>.
Ok that’s awesome. I’m also seeing that they have an s3 IO setting [https://beam.apache.org/releases/pydoc/2.23.0/apache_beam.io.aws.s3io.html] . Seems that if it’s just a pip install we could start out with just File (I imagine on kubernetes this could even work with volume mounts) and S3, and then add more as time goes on? Are there any downsides with us tying this into Beam? (e.g. if we want to use a storage system not yet supported by beam).
via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
On Sun, Sep 6, 2020 at 1:24 PM, Tomasz Urbaszek <tu...@apache.org> wrote:
I checked it with our Beam team and DirectRunner is supported by
Python SDK and requires no JVM. That's the main reason I think it's
worth considering it :) Hard dependency od JVM would be probably a
no-go for us.
https://beam.apache.org/documentation/runners/direct/

Tomek


On Sun, Sep 6, 2020 at 9:45 PM Daniel Imberman
<da...@gmail.com> wrote:
>
> Oof ok yeah. I hadn't realized that beam had a hard JVM requirement. I
> think that initially offering a local or block storage based solution with
> easy extensions for users is totally in line with airflow philosophy. I
> think that offering alternative transfer operators inproviders is a great
> idea!
>
> On Sun, Sep 6, 2020, 9:07 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> > No strong opinion - but it seems like generic is the easiest for us to
> > code (as we have most of it already via hooks?) and adopt (and doesn't
> > place a hard requirement on Beam/JVM, even if JVM would only be runtime.
> > Still)
> >
> > This is possibly where Airflow has a core TransferOperator, and
> > providers.apache.beam.operators.BeamTransferOperator? If the "same" python
> > API could be used for both, and it doesn't needlessly complicated things.
> >
> > -a
> >
> > On 6 September 2020 16:20:37 BST, Tomasz Urbaszek <tu...@apache.org>
> > wrote:
> > >Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This
> > >one looks really interesting for blob storages transfer!
> > >
> > >As stated in the initial design doc I don't think we should focus on
> > >best performance but rather on versatility. Currently, we have many
> > >AtoB operators that do not yield the highest performance but do their
> > >work and are widely used.
> > >
> > >I would say that we should prepare an AIP that will propose two
> > >approaches: generic vs beam. This will allow us to compare them and
> > >then we can vote which one is better from the Airflow community
> > >perspective.
> > >
> > >What do you think?
> > >
> > >Tomek
> > >
> > >
> > >On Sun, Sep 6, 2020 at 2:42 PM Ash Berlin-Taylor <as...@apache.org>
> > >wrote:
> > >>
> > >> For background: in the past I had an S3 to S3 transfer using
> > >smartopen (since we wanted to split one giant ~300GB file onto smaller
> > >parts) and it took about 10mins, so even "large" uses can work fine in
> > >Airflow - no JVM required.
> > >>
> > >> -ash
> > >>
> > >> On 6 September 2020 12:01:24 BST, Tomasz Urbaszek
> > ><tu...@apache.org> wrote:
> > >> >I think using direct runner as default with the option to specify
> > >> >other setup is a win-win. However, there are few doubts I have about
> > >> >Beam based approach:
> > >> >
> > >> >1. Dependency management. If I do `pip install apache-airflow[gcp]`
> > >> >will it install `apache-beam[gcp]`? What if there's a version clash
> > >> >between dependencies?
> > >> >
> > >> >2. The initial approach using `DataSource` concept allowed users to
> > >> >use it in any operator (not only transfer ones). In case of relying
> > >on
> > >> >Beam we are losing this.
> > >> >
> > >> >3. I'm not a Beam expert but it seems to not support any data
> > >lineage
> > >> >solution?
> > >> >
> > >> >
> > >> >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
> > >> ><da...@gmail.com> wrote:
> > >> >>
> > >> >> I think there are absolutely use-cases for both. I’m totally fine
> > >> >with saying “for small/medium use-cases, we come with an in-house
> > >> >system. However for larger cases, you’ll require spark/Flink/S3.
> > >That’s
> > >> >totally in line with PLENTY of use-cases. This would be especially
> > >cool
> > >> >when matched with fast-follow as we could EVEN potentially tie in
> > >data
> > >> >locality.
> > >> >>
> > >> >> via Newton Mail
> > >>
> > >>[
> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > ]
> > >> >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
> > >> ><wh...@gmail.com> wrote:
> > >> >> I believe - for not large data - the direct runner is wholly
> > >doable,
> > >> >which
> > >> >> seems in line with airflow patterns. I have, and have spoken with
> > >> >several
> > >> >> others that have, been productive with that runner.
> > >> >>
> > >> >> For much larger transfers, the generic operator could accept
> > >> >parameters for
> > >> >> submitting the compute to an actual runner. Though, imagining that
> > >> >> (needing a runner) would not be the primary use case for such an
> > >> >operator.
> > >> >>
> > >> >>
> > >> >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek
> > ><tu...@apache.org>
> > >> >wrote:
> > >> >>
> > >> >> > Austin, you are right, Beam covers all (and more) important IOs.
> > >> >> > However, using Apache Beam to design a generic transfer operator
> > >> >> > requires Airflow users to have additional resources that will be
> > >> >used
> > >> >> > as a runner (Spark, Flink, etc.). Unless you suggest using
> > >> >> > DirectRunner?
> > >> >> >
> > >> >> > Can you please tell us more how exactly you think we can use
> > >Beam
> > >> >for
> > >> >> > those Airflow transfer operators?
> > >> >> >
> > >> >> > Best,
> > >> >> > Tomek
> > >> >> >
> > >> >> >
> > >> >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> > >> >> > <wh...@gmail.com> wrote:
> > >> >> > >
> > >> >> > > Are there IOs that would be desired for a generic transfer
> > >> >operator that
> > >> >> > > don't exist in:
> > >> >https://beam.apache.org/documentation/io/built-in/ <-
> > >> >> > > there is pretty solid coverage?
> > >> >> > >
> > >> >> > > Beam is getting to the point where even python beam can
> > >leverage
> > >> >the java
> > >> >> > > IOs, which increases the range of IOs (and performance).
> > >> >> > >
> > >> >> > >
> > >> >> > >
> > >> >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
> > >> ><Ja...@polidea.com>
> > >> >> > > wrote:
> > >> >> > >
> > >> >> > > > But I believe those two ideas are separate ones as Tomek
> > >> >explained :)
> > >> >> > > >
> > >> >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
> > >> ><Jarek.Potiuk@polidea.com
> > >> >> > >
> > >> >> > > > wrote:
> > >> >> > > >
> > >> >> > > > > I love the idea of connecting the projects more closely!
> > >> >> > > > >
> > >> >> > > > > I've been helping recently as a consultant in improving
> > >the
> > >> >Apache
> > >> >> > Beam
> > >> >> > > > > build infrastructure (in many parts based on my Airflow
> > >> >experience
> > >> >> > and
> > >> >> > > > > Github Actions - even recently they adopted the "cancel"
> > >> >action I
> > >> >> > > > developed
> > >> >> > > > > for Apache Airflow).
> > >> >https://github.com/apache/beam/pull/12729
> > >> >> > > > >
> > >> >> > > > > Synergies in Apache projects are cool.
> > >> >> > > > >
> > >> >> > > > > J.
> > >> >> > > > >
> > >> >> > > > >
> > >> >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > >> >> > > > > <gc...@twitter.com.invalid> wrote:
> > >> >> > > > >
> > >> >> > > > >> Agree on keeping those separate, just intervened as I
> > >> >believe its a
> > >> >> > > > great
> > >> >> > > > >> idea. But lets keep @beam and @spark to a separate
> > >thread.
> > >> >> > > > >>
> > >> >> > > > >>
> > >> >> > > > >> Gerard Casas Saez
> > >> >> > > > >> Twitter | Cortex | @casassaez
> > ><http://twitter.com/casassaez>
> > >> >> > > > >>
> > >> >> > > > >>
> > >> >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> > >> >> > turbaszek@apache.org>
> > >> >> > > > >> wrote:
> > >> >> > > > >>
> > >> >> > > > >> > Daniel is right we have few Apache Beam committers in
> > >> >Polidea so
> > >> >> > we
> > >> >> > > > >> > will ask for advice. However, I would be highly in
> > >favor
> > >> >of
> > >> >> > having it
> > >> >> > > > >> > as Gerard suggested as @beam decorator. This is
> > >something
> > >> >we
> > >> >> > should
> > >> >> > > > >> > put into another AIP together with the mentioned @spark
> > >> >decorator.
> > >> >> > > > >> >
> > >> >> > > > >> > Our proposition of transfer operators was mainly to
> > >create
> > >> >> > something
> > >> >> > > > >> > Airflow-native that works out of the box and allows us
> > >to
> > >> >simplify
> > >> >> > > > >> > read/write from external sources. Thus, it requires no
> > >> >external
> > >> >> > > > >> > dependency other than the library to communicate with
> > >the
> > >> >API. In
> > >> >> > the
> > >> >> > > > >> > case of Beam we need more than that I think.
> > >> >> > > > >> >
> > >> >> > > > >> > Additionally, the ideas of Source and Destination play
> > >> >nicely with
> > >> >> > > > >> > data lineage and may bring more interest to this
> > >feature
> > >> >of
> > >> >> > Airflow.
> > >> >> > > > >> >
> > >> >> > > > >> > Cheers,
> > >> >> > > > >> > Tomek
> > >> >> > > > >> >
> > >> >> > > > >> >
> > >> >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
> > >> ><ka...@gmail.com>
> > >> >> > > > wrote:
> > >> >> > > > >> > >
> > >> >> > > > >> > > Nice. Just a note here, we will need to make sure
> > >that
> > >> >those
> > >> >> > > > "Source"
> > >> >> > > > >> and
> > >> >> > > > >> > > "Destination" needs to be serializable.
> > >> >> > > > >> > >
> > >> >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > >> >> > > > daniel.imberman@gmail.com
> > >> >> > > > >> >
> > >> >> > > > >> > > wrote:
> > >> >> > > > >> > >
> > >> >> > > > >> > > > Interesting! Beam also could potentially allow
> > >> >transfers
> > >> >> > within
> > >> >> > > > >> > Dask/any
> > >> >> > > > >> > > > other system with a java/python SDK? I think @jarek
> > >> >and
> > >> >> > Polidea
> > >> >> > > > do a
> > >> >> > > > >> > lot of
> > >> >> > > > >> > > > work with Beam as well so I’d love their thoughts
> > >if
> > >> >this a
> > >> >> > good
> > >> >> > > > >> > use-case.
> > >> >> > > > >> > > >
> > >> >> > > > >> > > > via Newton Mail [
> > >> >> > > > >> > > >
> > >> >> > > > >> >
> > >> >> > > > >>
> > >> >> > > >
> > >> >> >
> > >>
> > >>
> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > >> >> > > > >> > > > ]
> > >> >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez
> > ><
> > >> >> > > > >> > gcasassaez@twitter.com.invalid>
> > >> >> > > > >> > > > wrote:
> > >> >> > > > >> > > > I would be highly in favour of having a generic
> > >Beam
> > >> >operator.
> > >> >> > > > >> Similar
> > >> >> > > > >> > > > to @spark_task decorator. Something where you can
> > >> >easily
> > >> >> > define
> > >> >> > > > and
> > >> >> > > > >> > wrap a
> > >> >> > > > >> > > > beam pipeline and convert it to an Airflow
> > >operator.
> > >> >> > > > >> > > >
> > >> >> > > > >> > > > Gerard Casas Saez
> > >> >> > > > >> > > > Twitter | Cortex | @casassaez
> > >> ><http://twitter.com/casassaez>
> > >> >> > > > >> > > >
> > >> >> > > > >> > > >
> > >> >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > >> >> > > > >> > > > whatwouldaustindo@gmail.com>
> > >> >> > > > >> > > > wrote:
> > >> >> > > > >> > > >
> > >> >> > > > >> > > > > Are you guys familiar with Beam
> > >> ><https://beam.apache.org>?
> > >> >> > Esp.
> > >> >> > > > >> if
> > >> >> > > > >> > not
> > >> >> > > > >> > > > > doing transforms, it might rather straightforward
> > >to
> > >> >rely
> > >> >> > on the
> > >> >> > > > >> > > > ecosystem
> > >> >> > > > >> > > > > of connectors in that Apache Project to use as
> > >the
> > >> >> > foundations
> > >> >> > > > >> for a
> > >> >> > > > >> > > > > generic transfer operator.
> > >> >> > > > >> > > > >
> > >> >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > >> >> > > > >> > Jarek.Potiuk@polidea.com>
> > >> >> > > > >> > > > > wrote:
> > >> >> > > > >> > > > >
> > >> >> > > > >> > > > > > +1
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski
> > ><
> > >> >> > > > >> > > > > > kamil.olszewski@polidea.com>
> > >> >> > > > >> > > > > > wrote:
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > > Hello all,
> > >> >> > > > >> > > > > > > since there have been no new comments shared
> > >in
> > >> >the POC
> > >> >> > doc
> > >> >> > > > >> > > > > > > <
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > >
> > >> >> > > > >> > > >
> > >> >> > > > >> >
> > >> >> > > > >>
> > >> >> > > >
> > >> >> >
> > >>
> > >>
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > > for a couple of days, then I will proceed
> > >with
> > >> >creating
> > >> >> > an
> > >> >> > > > AIP
> > >> >> > > > >> > for
> > >> >> > > > >> > > > this
> > >> >> > > > >> > > > > > > feature, if that is ok with everybody.
> > >> >> > > > >> > > > > > > Best regards,
> > >> >> > > > >> > > > > > > Kamil
> > >> >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz
> > >Urbaszek
> > >> ><
> > >> >> > > > >> > > > turbaszek@apache.org
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > > wrote:
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > > I like the approach as it itnroduces
> > >another
> > >> >> > interesting
> > >> >> > > > >> > operators'
> > >> >> > > > >> > > > > > > > interface standarization. It would be
> > >awesome
> > >> >to here
> > >> >> > more
> > >> >> > > > >> > opinions
> > >> >> > > > >> > > > > :)
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > > > Cheers,
> > >> >> > > > >> > > > > > > > Tomek
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek
> > >Potiuk <
> > >> >> > > > >> > > > > Jarek.Potiuk@polidea.com
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > > wrote:
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > > > > I like the idea a lot. Similar things
> > >have
> > >> >been
> > >> >> > > > discussed
> > >> >> > > > >> > before
> > >> >> > > > >> > > > > but
> > >> >> > > > >> > > > > > > the
> > >> >> > > > >> > > > > > > > > proposal is I think rather pragmatic and
> > >> >solves a
> > >> >> > real
> > >> >> > > > >> > problem
> > >> >> > > > >> > > > (and
> > >> >> > > > >> > > > > > it
> > >> >> > > > >> > > > > > > > does
> > >> >> > > > >> > > > > > > > > not seem to be too complex to implement)
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > There is some discussion about it already
> > >in
> > >> >the
> > >> >> > > > document
> > >> >> > > > >> > (please
> > >> >> > > > >> > > > > > > > chime-in
> > >> >> > > > >> > > > > > > > > for those interested) but here a few
> > >points
> > >> >why I
> > >> >> > like
> > >> >> > > > it:
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - performance and optimization is not a
> > >> >focus for
> > >> >> > that.
> > >> >> > > > >> For
> > >> >> > > > >> > > > generic
> > >> >> > > > >> > > > > > > stuff
> > >> >> > > > >> > > > > > > > > it is usually to write "optimal" solution
> > >> >but once
> > >> >> > you
> > >> >> > > > >> admit
> > >> >> > > > >> > you
> > >> >> > > > >> > > > > are
> > >> >> > > > >> > > > > > > not
> > >> >> > > > >> > > > > > > > > going to focus for optimisation, you come
> > >> >with
> > >> >> > simpler
> > >> >> > > > and
> > >> >> > > > >> > easier
> > >> >> > > > >> > > > > to
> > >> >> > > > >> > > > > > > use
> > >> >> > > > >> > > > > > > > > solutions
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - on the other hand - it uses very
> > >> >"Python'y"
> > >> >> > approach
> > >> >> > > > >> with
> > >> >> > > > >> > using
> > >> >> > > > >> > > > > > > > > Airflow's familiar concepts (connection,
> > >> >transfer)
> > >> >> > and
> > >> >> > > > has
> > >> >> > > > >> > the
> > >> >> > > > >> > > > > > > potential
> > >> >> > > > >> > > > > > > > of
> > >> >> > > > >> > > > > > > > > plugging in into 100s of hooks we have
> > >> >already
> > >> >> > easily -
> > >> >> > > > >> > > > leveraging
> > >> >> > > > >> > > > > > all
> > >> >> > > > >> > > > > > > > the
> > >> >> > > > >> > > > > > > > > "providers" richness of Airflow.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - it aims to be easy to do "quick start"
> > >-
> > >> >if you
> > >> >> > have a
> > >> >> > > > >> > number
> > >> >> > > > >> > > > of
> > >> >> > > > >> > > > > > > > > different sources/targets and as a data
> > >> >scientist
> > >> >> > you
> > >> >> > > > >> would
> > >> >> > > > >> > like
> > >> >> > > > >> > > > to
> > >> >> > > > >> > > > > > > > quickly
> > >> >> > > > >> > > > > > > > > start transferring data between them -
> > >you
> > >> >can do it
> > >> >> > > > >> easily
> > >> >> > > > >> > with
> > >> >> > > > >> > > > > > only
> > >> >> > > > >> > > > > > > > > basic python knowledge and simple DAG
> > >> >structure.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - it should be possible to plug it in
> > >into
> > >> >our new
> > >> >> > > > >> functional
> > >> >> > > > >> > > > > > approach
> > >> >> > > > >> > > > > > > as
> > >> >> > > > >> > > > > > > > > well as future lineage discussions as it
> > >> >makes
> > >> >> > > > connection
> > >> >> > > > >> > between
> > >> >> > > > >> > > > > > > sources
> > >> >> > > > >> > > > > > > > > and targets
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - it opens up possibilities of adding
> > >simple
> > >> >and
> > >> >> > > > flexible
> > >> >> > > > >> > data
> > >> >> > > > >> > > > > > > > > transformation on-transfer. Not a
> > >> >replacement for
> > >> >> > any of
> > >> >> > > > >> the
> > >> >> > > > >> > > > > external
> > >> >> > > > >> > > > > > > > > services that Airflow should use (Airflow
> > >is
> > >> >an
> > >> >> > > > >> > orchestrator, not
> > >> >> > > > >> > > > > > data
> > >> >> > > > >> > > > > > > > > processing solution) but for the kind of
> > >> >quick-start
> > >> >> > > > >> > scenarios I
> > >> >> > > > >> > > > > > > foresee
> > >> >> > > > >> > > > > > > > it
> > >> >> > > > >> > > > > > > > > might be most useful, being able to apply
> > >> >simple
> > >> >> > data
> > >> >> > > > >> > > > > transformation
> > >> >> > > > >> > > > > > on
> > >> >> > > > >> > > > > > > > the
> > >> >> > > > >> > > > > > > > > fly by data scientist might be a big
> > >plus.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format
> > >of
> > >> >the
> > >> >> > "data"
> > >> >> > > > >> > component
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > Kamil - you should have access now.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > J.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
> > >> >Olszewski <
> > >> >> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
> > >> >> > > > >> > > > > > > > > wrote:
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > > Hello all,
> > >> >> > > > >> > > > > > > > > > in Polidea we have come up with an idea
> > >> >for a
> > >> >> > generic
> > >> >> > > > >> > transfer
> > >> >> > > > >> > > > > > > operator
> > >> >> > > > >> > > > > > > > > > that would be able to transport data
> > >> >between two
> > >> >> > > > >> > destinations
> > >> >> > > > >> > > > of
> > >> >> > > > >> > > > > > > > various
> > >> >> > > > >> > > > > > > > > > types (file, database, storage, etc.) -
> > >> >please
> > >> >> > find
> > >> >> > > > the
> > >> >> > > > >> > link
> > >> >> > > > >> > > > > with a
> > >> >> > > > >> > > > > > > > short
> > >> >> > > > >> > > > > > > > > > doc with POC
> > >> >> > > > >> > > > > > > > > > <
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > >
> > >> >> > > > >> > > >
> > >> >> > > > >> >
> > >> >> > > > >>
> > >> >> > > >
> > >> >> >
> > >>
> > >>
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > >> >> > > > >> > > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > where we can discuss the design
> > >initially.
> > >> >Once we
> > >> >> > > > come
> > >> >> > > > >> to
> > >> >> > > > >> > the
> > >> >> > > > >> > > > > > > initial
> > >> >> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki
> > >-
> > >> >can I
> > >> >> > ask
> > >> >> > > > for
> > >> >> > > > >> > > > > permission
> > >> >> > > > >> > > > > > to
> > >> >> > > > >> > > > > > > > do
> > >> >> > > > >> > > > > > > > > so
> > >> >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe
> > >> >that
> > >> >> > during
> > >> >> > > > the
> > >> >> > > > >> > > > > discussion
> > >> >> > > > >> > > > > > we
> > >> >> > > > >> > > > > > > > > > should definitely aim for this feature
> > >to
> > >> >be
> > >> >> > released
> > >> >> > > > >> only
> > >> >> > > > >> > > > after
> > >> >> > > > >> > > > > > > > Airflow
> > >> >> > > > >> > > > > > > > > > 2.0 is out.
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > What do you think about this idea?
> > >Would
> > >> >you find
> > >> >> > such
> > >> >> > > > >> an
> > >> >> > > > >> > > > > operator
> > >> >> > > > >> > > > > > > > > helpful
> > >> >> > > > >> > > > > > > > > > in your pipelines? Maybe you already
> > >use a
> > >> >similar
> > >> >> > > > >> > solution or
> > >> >> > > > >> > > > > know
> > >> >> > > > >> > > > > > > > > > packages that could be used to
> > >implement
> > >> >it?
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > Best regards,
> > >> >> > > > >> > > > > > > > > > --
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > Kamil Olszewski
> > >> >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
> > >> >Software
> > >> >> > Engineer
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > M: +48 503 361 783
> > >> >> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > Unique Tech
> > >> >> > > > >> > > > > > > > > > Check out our projects! <
> > >> >> > > > >> https://www.polidea.com/our-work>
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > --
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > Jarek Potiuk
> > >> >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
> > >> >Principal
> > >> >> > Software
> > >> >> > > > >> > Engineer
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > >> >> > > > >> > > > > > > > > [image: Polidea]
> > ><https://www.polidea.com/>
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > --
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > Kamil Olszewski
> > >> >> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software
> > >> >Engineer
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > M: +48 503 361 783
> > >> >> > > > >> > > > > > > E: kamil.olszewski@polidea.com
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > Unique Tech
> > >> >> > > > >> > > > > > > Check out our projects! <
> > >> >> > https://www.polidea.com/our-work>
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > --
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > Jarek Potiuk
> > >> >> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal
> > >> >Software
> > >> >> > > > >> Engineer
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
> > >> >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > >
> > >> >> > > > >> >
> > >> >> > > > >> >
> > >> >> > > > >> >
> > >> >> > > > >> > --
> > >> >> > > > >> >
> > >> >> > > > >> > Tomasz Urbaszek
> > >> >> > > > >> > Polidea | Software Engineer
> > >> >> > > > >> >
> > >> >> > > > >> > M: +48 505 628 493
> > >> >> > > > >> > E: tomasz.urbaszek@polidea.com
> > >> >> > > > >> >
> > >> >> > > > >> > Unique Tech
> > >> >> > > > >> > Check out our projects!
> > >> >> > > > >> >
> > >> >> > > > >>
> > >> >> > > > >
> > >> >> > > > >
> > >> >> > > > > --
> > >> >> > > > >
> > >> >> > > > > Jarek Potiuk
> > >> >> > > > > Polidea <https://www.polidea.com/> | Principal Software
> > >> >Engineer
> > >> >> > > > >
> > >> >> > > > > M: +48 660 796 129 <+48660796129>
> > >> >> > > > > [image: Polidea] <https://www.polidea.com/>
> > >> >> > > > >
> > >> >> > > > >
> > >> >> > > >
> > >> >> > > > --
> > >> >> > > >
> > >> >> > > > Jarek Potiuk
> > >> >> > > > Polidea <https://www.polidea.com/> | Principal Software
> > >> >Engineer
> > >> >> > > >
> > >> >> > > > M: +48 660 796 129 <+48660796129>
> > >> >> > > > [image: Polidea] <https://www.polidea.com/>
> > >> >> > > >
> > >> >> >
> >



--

Tomasz Urbaszek
Polidea | Software Engineer

M: +48 505 628 493
E: tomasz.urbaszek@polidea.com

Unique Tech
Check out our projects!

Re: Generic Transfer Operator

Posted by Tomasz Urbaszek <tu...@apache.org>.
I checked it with our Beam team and DirectRunner is supported by
Python SDK and requires no JVM. That's the main reason I think it's
worth considering it :) Hard dependency od JVM would be probably a
no-go for us.
https://beam.apache.org/documentation/runners/direct/

Tomek


On Sun, Sep 6, 2020 at 9:45 PM Daniel Imberman
<da...@gmail.com> wrote:
>
> Oof ok yeah. I hadn't realized that beam had a hard JVM requirement. I
> think that initially offering a local or block storage based solution with
> easy extensions for users is totally in line with airflow philosophy. I
> think that offering alternative transfer operators inproviders is a great
> idea!
>
> On Sun, Sep 6, 2020, 9:07 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> > No strong opinion - but it seems like generic is the easiest for us to
> > code (as we have most of it already via hooks?) and adopt (and doesn't
> > place a hard requirement on Beam/JVM, even if JVM would only be runtime.
> > Still)
> >
> > This is possibly where Airflow has a core TransferOperator, and
> > providers.apache.beam.operators.BeamTransferOperator? If the "same" python
> > API could be used for both, and it doesn't needlessly complicated things.
> >
> > -a
> >
> > On 6 September 2020 16:20:37 BST, Tomasz Urbaszek <tu...@apache.org>
> > wrote:
> > >Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This
> > >one looks really interesting for blob storages transfer!
> > >
> > >As stated in the initial design doc I don't think we should focus on
> > >best performance but rather on versatility. Currently, we have many
> > >AtoB operators that do not yield the highest performance but do their
> > >work and are widely used.
> > >
> > >I would say that we should prepare an AIP that will propose two
> > >approaches: generic vs beam. This will allow us to compare them and
> > >then we can vote which one is better from the Airflow community
> > >perspective.
> > >
> > >What do you think?
> > >
> > >Tomek
> > >
> > >
> > >On Sun, Sep 6, 2020 at 2:42 PM Ash Berlin-Taylor <as...@apache.org>
> > >wrote:
> > >>
> > >> For background: in the past I had an S3 to S3 transfer using
> > >smartopen (since we wanted to split one giant ~300GB file onto smaller
> > >parts) and it took about 10mins, so even "large" uses can work fine in
> > >Airflow - no JVM required.
> > >>
> > >> -ash
> > >>
> > >> On 6 September 2020 12:01:24 BST, Tomasz Urbaszek
> > ><tu...@apache.org> wrote:
> > >> >I think using direct runner as default with the option to specify
> > >> >other setup is a win-win. However, there are few doubts I have about
> > >> >Beam based approach:
> > >> >
> > >> >1. Dependency management. If I do `pip install apache-airflow[gcp]`
> > >> >will it install `apache-beam[gcp]`? What if there's a version clash
> > >> >between dependencies?
> > >> >
> > >> >2. The initial approach using `DataSource` concept allowed users to
> > >> >use it in any operator (not only transfer ones). In case of relying
> > >on
> > >> >Beam we are losing this.
> > >> >
> > >> >3. I'm not a Beam expert but it seems to not support any data
> > >lineage
> > >> >solution?
> > >> >
> > >> >
> > >> >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
> > >> ><da...@gmail.com> wrote:
> > >> >>
> > >> >> I think there are absolutely use-cases for both. I’m totally fine
> > >> >with saying “for small/medium use-cases, we come with an in-house
> > >> >system. However for larger cases, you’ll require spark/Flink/S3.
> > >That’s
> > >> >totally in line with PLENTY of use-cases. This would be especially
> > >cool
> > >> >when matched with fast-follow as we could EVEN potentially tie in
> > >data
> > >> >locality.
> > >> >>
> > >> >> via Newton Mail
> > >>
> > >>[
> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > ]
> > >> >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
> > >> ><wh...@gmail.com> wrote:
> > >> >> I believe - for not large data - the direct runner is wholly
> > >doable,
> > >> >which
> > >> >> seems in line with airflow patterns. I have, and have spoken with
> > >> >several
> > >> >> others that have, been productive with that runner.
> > >> >>
> > >> >> For much larger transfers, the generic operator could accept
> > >> >parameters for
> > >> >> submitting the compute to an actual runner. Though, imagining that
> > >> >> (needing a runner) would not be the primary use case for such an
> > >> >operator.
> > >> >>
> > >> >>
> > >> >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek
> > ><tu...@apache.org>
> > >> >wrote:
> > >> >>
> > >> >> > Austin, you are right, Beam covers all (and more) important IOs.
> > >> >> > However, using Apache Beam to design a generic transfer operator
> > >> >> > requires Airflow users to have additional resources that will be
> > >> >used
> > >> >> > as a runner (Spark, Flink, etc.). Unless you suggest using
> > >> >> > DirectRunner?
> > >> >> >
> > >> >> > Can you please tell us more how exactly you think we can use
> > >Beam
> > >> >for
> > >> >> > those Airflow transfer operators?
> > >> >> >
> > >> >> > Best,
> > >> >> > Tomek
> > >> >> >
> > >> >> >
> > >> >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> > >> >> > <wh...@gmail.com> wrote:
> > >> >> > >
> > >> >> > > Are there IOs that would be desired for a generic transfer
> > >> >operator that
> > >> >> > > don't exist in:
> > >> >https://beam.apache.org/documentation/io/built-in/ <-
> > >> >> > > there is pretty solid coverage?
> > >> >> > >
> > >> >> > > Beam is getting to the point where even python beam can
> > >leverage
> > >> >the java
> > >> >> > > IOs, which increases the range of IOs (and performance).
> > >> >> > >
> > >> >> > >
> > >> >> > >
> > >> >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
> > >> ><Ja...@polidea.com>
> > >> >> > > wrote:
> > >> >> > >
> > >> >> > > > But I believe those two ideas are separate ones as Tomek
> > >> >explained :)
> > >> >> > > >
> > >> >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
> > >> ><Jarek.Potiuk@polidea.com
> > >> >> > >
> > >> >> > > > wrote:
> > >> >> > > >
> > >> >> > > > > I love the idea of connecting the projects more closely!
> > >> >> > > > >
> > >> >> > > > > I've been helping recently as a consultant in improving
> > >the
> > >> >Apache
> > >> >> > Beam
> > >> >> > > > > build infrastructure (in many parts based on my Airflow
> > >> >experience
> > >> >> > and
> > >> >> > > > > Github Actions - even recently they adopted the "cancel"
> > >> >action I
> > >> >> > > > developed
> > >> >> > > > > for Apache Airflow).
> > >> >https://github.com/apache/beam/pull/12729
> > >> >> > > > >
> > >> >> > > > > Synergies in Apache projects are cool.
> > >> >> > > > >
> > >> >> > > > > J.
> > >> >> > > > >
> > >> >> > > > >
> > >> >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > >> >> > > > > <gc...@twitter.com.invalid> wrote:
> > >> >> > > > >
> > >> >> > > > >> Agree on keeping those separate, just intervened as I
> > >> >believe its a
> > >> >> > > > great
> > >> >> > > > >> idea. But lets keep @beam and @spark to a separate
> > >thread.
> > >> >> > > > >>
> > >> >> > > > >>
> > >> >> > > > >> Gerard Casas Saez
> > >> >> > > > >> Twitter | Cortex | @casassaez
> > ><http://twitter.com/casassaez>
> > >> >> > > > >>
> > >> >> > > > >>
> > >> >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> > >> >> > turbaszek@apache.org>
> > >> >> > > > >> wrote:
> > >> >> > > > >>
> > >> >> > > > >> > Daniel is right we have few Apache Beam committers in
> > >> >Polidea so
> > >> >> > we
> > >> >> > > > >> > will ask for advice. However, I would be highly in
> > >favor
> > >> >of
> > >> >> > having it
> > >> >> > > > >> > as Gerard suggested as @beam decorator. This is
> > >something
> > >> >we
> > >> >> > should
> > >> >> > > > >> > put into another AIP together with the mentioned @spark
> > >> >decorator.
> > >> >> > > > >> >
> > >> >> > > > >> > Our proposition of transfer operators was mainly to
> > >create
> > >> >> > something
> > >> >> > > > >> > Airflow-native that works out of the box and allows us
> > >to
> > >> >simplify
> > >> >> > > > >> > read/write from external sources. Thus, it requires no
> > >> >external
> > >> >> > > > >> > dependency other than the library to communicate with
> > >the
> > >> >API. In
> > >> >> > the
> > >> >> > > > >> > case of Beam we need more than that I think.
> > >> >> > > > >> >
> > >> >> > > > >> > Additionally, the ideas of Source and Destination play
> > >> >nicely with
> > >> >> > > > >> > data lineage and may bring more interest to this
> > >feature
> > >> >of
> > >> >> > Airflow.
> > >> >> > > > >> >
> > >> >> > > > >> > Cheers,
> > >> >> > > > >> > Tomek
> > >> >> > > > >> >
> > >> >> > > > >> >
> > >> >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
> > >> ><ka...@gmail.com>
> > >> >> > > > wrote:
> > >> >> > > > >> > >
> > >> >> > > > >> > > Nice. Just a note here, we will need to make sure
> > >that
> > >> >those
> > >> >> > > > "Source"
> > >> >> > > > >> and
> > >> >> > > > >> > > "Destination" needs to be serializable.
> > >> >> > > > >> > >
> > >> >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > >> >> > > > daniel.imberman@gmail.com
> > >> >> > > > >> >
> > >> >> > > > >> > > wrote:
> > >> >> > > > >> > >
> > >> >> > > > >> > > > Interesting! Beam also could potentially allow
> > >> >transfers
> > >> >> > within
> > >> >> > > > >> > Dask/any
> > >> >> > > > >> > > > other system with a java/python SDK? I think @jarek
> > >> >and
> > >> >> > Polidea
> > >> >> > > > do a
> > >> >> > > > >> > lot of
> > >> >> > > > >> > > > work with Beam as well so I’d love their thoughts
> > >if
> > >> >this a
> > >> >> > good
> > >> >> > > > >> > use-case.
> > >> >> > > > >> > > >
> > >> >> > > > >> > > > via Newton Mail [
> > >> >> > > > >> > > >
> > >> >> > > > >> >
> > >> >> > > > >>
> > >> >> > > >
> > >> >> >
> > >>
> > >>
> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > >> >> > > > >> > > > ]
> > >> >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez
> > ><
> > >> >> > > > >> > gcasassaez@twitter.com.invalid>
> > >> >> > > > >> > > > wrote:
> > >> >> > > > >> > > > I would be highly in favour of having a generic
> > >Beam
> > >> >operator.
> > >> >> > > > >> Similar
> > >> >> > > > >> > > > to @spark_task decorator. Something where you can
> > >> >easily
> > >> >> > define
> > >> >> > > > and
> > >> >> > > > >> > wrap a
> > >> >> > > > >> > > > beam pipeline and convert it to an Airflow
> > >operator.
> > >> >> > > > >> > > >
> > >> >> > > > >> > > > Gerard Casas Saez
> > >> >> > > > >> > > > Twitter | Cortex | @casassaez
> > >> ><http://twitter.com/casassaez>
> > >> >> > > > >> > > >
> > >> >> > > > >> > > >
> > >> >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > >> >> > > > >> > > > whatwouldaustindo@gmail.com>
> > >> >> > > > >> > > > wrote:
> > >> >> > > > >> > > >
> > >> >> > > > >> > > > > Are you guys familiar with Beam
> > >> ><https://beam.apache.org>?
> > >> >> > Esp.
> > >> >> > > > >> if
> > >> >> > > > >> > not
> > >> >> > > > >> > > > > doing transforms, it might rather straightforward
> > >to
> > >> >rely
> > >> >> > on the
> > >> >> > > > >> > > > ecosystem
> > >> >> > > > >> > > > > of connectors in that Apache Project to use as
> > >the
> > >> >> > foundations
> > >> >> > > > >> for a
> > >> >> > > > >> > > > > generic transfer operator.
> > >> >> > > > >> > > > >
> > >> >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > >> >> > > > >> > Jarek.Potiuk@polidea.com>
> > >> >> > > > >> > > > > wrote:
> > >> >> > > > >> > > > >
> > >> >> > > > >> > > > > > +1
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski
> > ><
> > >> >> > > > >> > > > > > kamil.olszewski@polidea.com>
> > >> >> > > > >> > > > > > wrote:
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > > Hello all,
> > >> >> > > > >> > > > > > > since there have been no new comments shared
> > >in
> > >> >the POC
> > >> >> > doc
> > >> >> > > > >> > > > > > > <
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > >
> > >> >> > > > >> > > >
> > >> >> > > > >> >
> > >> >> > > > >>
> > >> >> > > >
> > >> >> >
> > >>
> > >>
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > > for a couple of days, then I will proceed
> > >with
> > >> >creating
> > >> >> > an
> > >> >> > > > AIP
> > >> >> > > > >> > for
> > >> >> > > > >> > > > this
> > >> >> > > > >> > > > > > > feature, if that is ok with everybody.
> > >> >> > > > >> > > > > > > Best regards,
> > >> >> > > > >> > > > > > > Kamil
> > >> >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz
> > >Urbaszek
> > >> ><
> > >> >> > > > >> > > > turbaszek@apache.org
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > > wrote:
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > > I like the approach as it itnroduces
> > >another
> > >> >> > interesting
> > >> >> > > > >> > operators'
> > >> >> > > > >> > > > > > > > interface standarization. It would be
> > >awesome
> > >> >to here
> > >> >> > more
> > >> >> > > > >> > opinions
> > >> >> > > > >> > > > > :)
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > > > Cheers,
> > >> >> > > > >> > > > > > > > Tomek
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek
> > >Potiuk <
> > >> >> > > > >> > > > > Jarek.Potiuk@polidea.com
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > > wrote:
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > > > > I like the idea a lot. Similar things
> > >have
> > >> >been
> > >> >> > > > discussed
> > >> >> > > > >> > before
> > >> >> > > > >> > > > > but
> > >> >> > > > >> > > > > > > the
> > >> >> > > > >> > > > > > > > > proposal is I think rather pragmatic and
> > >> >solves a
> > >> >> > real
> > >> >> > > > >> > problem
> > >> >> > > > >> > > > (and
> > >> >> > > > >> > > > > > it
> > >> >> > > > >> > > > > > > > does
> > >> >> > > > >> > > > > > > > > not seem to be too complex to implement)
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > There is some discussion about it already
> > >in
> > >> >the
> > >> >> > > > document
> > >> >> > > > >> > (please
> > >> >> > > > >> > > > > > > > chime-in
> > >> >> > > > >> > > > > > > > > for those interested) but here a few
> > >points
> > >> >why I
> > >> >> > like
> > >> >> > > > it:
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - performance and optimization is not a
> > >> >focus for
> > >> >> > that.
> > >> >> > > > >> For
> > >> >> > > > >> > > > generic
> > >> >> > > > >> > > > > > > stuff
> > >> >> > > > >> > > > > > > > > it is usually to write "optimal" solution
> > >> >but once
> > >> >> > you
> > >> >> > > > >> admit
> > >> >> > > > >> > you
> > >> >> > > > >> > > > > are
> > >> >> > > > >> > > > > > > not
> > >> >> > > > >> > > > > > > > > going to focus for optimisation, you come
> > >> >with
> > >> >> > simpler
> > >> >> > > > and
> > >> >> > > > >> > easier
> > >> >> > > > >> > > > > to
> > >> >> > > > >> > > > > > > use
> > >> >> > > > >> > > > > > > > > solutions
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - on the other hand - it uses very
> > >> >"Python'y"
> > >> >> > approach
> > >> >> > > > >> with
> > >> >> > > > >> > using
> > >> >> > > > >> > > > > > > > > Airflow's familiar concepts (connection,
> > >> >transfer)
> > >> >> > and
> > >> >> > > > has
> > >> >> > > > >> > the
> > >> >> > > > >> > > > > > > potential
> > >> >> > > > >> > > > > > > > of
> > >> >> > > > >> > > > > > > > > plugging in into 100s of hooks we have
> > >> >already
> > >> >> > easily -
> > >> >> > > > >> > > > leveraging
> > >> >> > > > >> > > > > > all
> > >> >> > > > >> > > > > > > > the
> > >> >> > > > >> > > > > > > > > "providers" richness of Airflow.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - it aims to be easy to do "quick start"
> > >-
> > >> >if you
> > >> >> > have a
> > >> >> > > > >> > number
> > >> >> > > > >> > > > of
> > >> >> > > > >> > > > > > > > > different sources/targets and as a data
> > >> >scientist
> > >> >> > you
> > >> >> > > > >> would
> > >> >> > > > >> > like
> > >> >> > > > >> > > > to
> > >> >> > > > >> > > > > > > > quickly
> > >> >> > > > >> > > > > > > > > start transferring data between them -
> > >you
> > >> >can do it
> > >> >> > > > >> easily
> > >> >> > > > >> > with
> > >> >> > > > >> > > > > > only
> > >> >> > > > >> > > > > > > > > basic python knowledge and simple DAG
> > >> >structure.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - it should be possible to plug it in
> > >into
> > >> >our new
> > >> >> > > > >> functional
> > >> >> > > > >> > > > > > approach
> > >> >> > > > >> > > > > > > as
> > >> >> > > > >> > > > > > > > > well as future lineage discussions as it
> > >> >makes
> > >> >> > > > connection
> > >> >> > > > >> > between
> > >> >> > > > >> > > > > > > sources
> > >> >> > > > >> > > > > > > > > and targets
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > - it opens up possibilities of adding
> > >simple
> > >> >and
> > >> >> > > > flexible
> > >> >> > > > >> > data
> > >> >> > > > >> > > > > > > > > transformation on-transfer. Not a
> > >> >replacement for
> > >> >> > any of
> > >> >> > > > >> the
> > >> >> > > > >> > > > > external
> > >> >> > > > >> > > > > > > > > services that Airflow should use (Airflow
> > >is
> > >> >an
> > >> >> > > > >> > orchestrator, not
> > >> >> > > > >> > > > > > data
> > >> >> > > > >> > > > > > > > > processing solution) but for the kind of
> > >> >quick-start
> > >> >> > > > >> > scenarios I
> > >> >> > > > >> > > > > > > foresee
> > >> >> > > > >> > > > > > > > it
> > >> >> > > > >> > > > > > > > > might be most useful, being able to apply
> > >> >simple
> > >> >> > data
> > >> >> > > > >> > > > > transformation
> > >> >> > > > >> > > > > > on
> > >> >> > > > >> > > > > > > > the
> > >> >> > > > >> > > > > > > > > fly by data scientist might be a big
> > >plus.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format
> > >of
> > >> >the
> > >> >> > "data"
> > >> >> > > > >> > component
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > Kamil - you should have access now.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > J.
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
> > >> >Olszewski <
> > >> >> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
> > >> >> > > > >> > > > > > > > > wrote:
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > > Hello all,
> > >> >> > > > >> > > > > > > > > > in Polidea we have come up with an idea
> > >> >for a
> > >> >> > generic
> > >> >> > > > >> > transfer
> > >> >> > > > >> > > > > > > operator
> > >> >> > > > >> > > > > > > > > > that would be able to transport data
> > >> >between two
> > >> >> > > > >> > destinations
> > >> >> > > > >> > > > of
> > >> >> > > > >> > > > > > > > various
> > >> >> > > > >> > > > > > > > > > types (file, database, storage, etc.) -
> > >> >please
> > >> >> > find
> > >> >> > > > the
> > >> >> > > > >> > link
> > >> >> > > > >> > > > > with a
> > >> >> > > > >> > > > > > > > short
> > >> >> > > > >> > > > > > > > > > doc with POC
> > >> >> > > > >> > > > > > > > > > <
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > >
> > >> >> > > > >> > > >
> > >> >> > > > >> >
> > >> >> > > > >>
> > >> >> > > >
> > >> >> >
> > >>
> > >>
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > >> >> > > > >> > > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > where we can discuss the design
> > >initially.
> > >> >Once we
> > >> >> > > > come
> > >> >> > > > >> to
> > >> >> > > > >> > the
> > >> >> > > > >> > > > > > > initial
> > >> >> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki
> > >-
> > >> >can I
> > >> >> > ask
> > >> >> > > > for
> > >> >> > > > >> > > > > permission
> > >> >> > > > >> > > > > > to
> > >> >> > > > >> > > > > > > > do
> > >> >> > > > >> > > > > > > > > so
> > >> >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe
> > >> >that
> > >> >> > during
> > >> >> > > > the
> > >> >> > > > >> > > > > discussion
> > >> >> > > > >> > > > > > we
> > >> >> > > > >> > > > > > > > > > should definitely aim for this feature
> > >to
> > >> >be
> > >> >> > released
> > >> >> > > > >> only
> > >> >> > > > >> > > > after
> > >> >> > > > >> > > > > > > > Airflow
> > >> >> > > > >> > > > > > > > > > 2.0 is out.
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > What do you think about this idea?
> > >Would
> > >> >you find
> > >> >> > such
> > >> >> > > > >> an
> > >> >> > > > >> > > > > operator
> > >> >> > > > >> > > > > > > > > helpful
> > >> >> > > > >> > > > > > > > > > in your pipelines? Maybe you already
> > >use a
> > >> >similar
> > >> >> > > > >> > solution or
> > >> >> > > > >> > > > > know
> > >> >> > > > >> > > > > > > > > > packages that could be used to
> > >implement
> > >> >it?
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > Best regards,
> > >> >> > > > >> > > > > > > > > > --
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > Kamil Olszewski
> > >> >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
> > >> >Software
> > >> >> > Engineer
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > M: +48 503 361 783
> > >> >> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > > > Unique Tech
> > >> >> > > > >> > > > > > > > > > Check out our projects! <
> > >> >> > > > >> https://www.polidea.com/our-work>
> > >> >> > > > >> > > > > > > > > >
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > --
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > Jarek Potiuk
> > >> >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
> > >> >Principal
> > >> >> > Software
> > >> >> > > > >> > Engineer
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > >> >> > > > >> > > > > > > > > [image: Polidea]
> > ><https://www.polidea.com/>
> > >> >> > > > >> > > > > > > > >
> > >> >> > > > >> > > > > > > >
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > --
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > Kamil Olszewski
> > >> >> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software
> > >> >Engineer
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > M: +48 503 361 783
> > >> >> > > > >> > > > > > > E: kamil.olszewski@polidea.com
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > > > Unique Tech
> > >> >> > > > >> > > > > > > Check out our projects! <
> > >> >> > https://www.polidea.com/our-work>
> > >> >> > > > >> > > > > > >
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > --
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > Jarek Potiuk
> > >> >> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal
> > >> >Software
> > >> >> > > > >> Engineer
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
> > >> >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > >> >> > > > >> > > > > >
> > >> >> > > > >> > > > >
> > >> >> > > > >> >
> > >> >> > > > >> >
> > >> >> > > > >> >
> > >> >> > > > >> > --
> > >> >> > > > >> >
> > >> >> > > > >> > Tomasz Urbaszek
> > >> >> > > > >> > Polidea | Software Engineer
> > >> >> > > > >> >
> > >> >> > > > >> > M: +48 505 628 493
> > >> >> > > > >> > E: tomasz.urbaszek@polidea.com
> > >> >> > > > >> >
> > >> >> > > > >> > Unique Tech
> > >> >> > > > >> > Check out our projects!
> > >> >> > > > >> >
> > >> >> > > > >>
> > >> >> > > > >
> > >> >> > > > >
> > >> >> > > > > --
> > >> >> > > > >
> > >> >> > > > > Jarek Potiuk
> > >> >> > > > > Polidea <https://www.polidea.com/> | Principal Software
> > >> >Engineer
> > >> >> > > > >
> > >> >> > > > > M: +48 660 796 129 <+48660796129>
> > >> >> > > > > [image: Polidea] <https://www.polidea.com/>
> > >> >> > > > >
> > >> >> > > > >
> > >> >> > > >
> > >> >> > > > --
> > >> >> > > >
> > >> >> > > > Jarek Potiuk
> > >> >> > > > Polidea <https://www.polidea.com/> | Principal Software
> > >> >Engineer
> > >> >> > > >
> > >> >> > > > M: +48 660 796 129 <+48660796129>
> > >> >> > > > [image: Polidea] <https://www.polidea.com/>
> > >> >> > > >
> > >> >> >
> >



-- 

Tomasz Urbaszek
Polidea | Software Engineer

M: +48 505 628 493
E: tomasz.urbaszek@polidea.com

Unique Tech
Check out our projects!

Re: Generic Transfer Operator

Posted by Daniel Imberman <da...@gmail.com>.
Oof ok yeah. I hadn't realized that beam had a hard JVM requirement. I
think that initially offering a local or block storage based solution with
easy extensions for users is totally in line with airflow philosophy. I
think that offering alternative transfer operators inproviders is a great
idea!

On Sun, Sep 6, 2020, 9:07 AM Ash Berlin-Taylor <as...@apache.org> wrote:

> No strong opinion - but it seems like generic is the easiest for us to
> code (as we have most of it already via hooks?) and adopt (and doesn't
> place a hard requirement on Beam/JVM, even if JVM would only be runtime.
> Still)
>
> This is possibly where Airflow has a core TransferOperator, and
> providers.apache.beam.operators.BeamTransferOperator? If the "same" python
> API could be used for both, and it doesn't needlessly complicated things.
>
> -a
>
> On 6 September 2020 16:20:37 BST, Tomasz Urbaszek <tu...@apache.org>
> wrote:
> >Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This
> >one looks really interesting for blob storages transfer!
> >
> >As stated in the initial design doc I don't think we should focus on
> >best performance but rather on versatility. Currently, we have many
> >AtoB operators that do not yield the highest performance but do their
> >work and are widely used.
> >
> >I would say that we should prepare an AIP that will propose two
> >approaches: generic vs beam. This will allow us to compare them and
> >then we can vote which one is better from the Airflow community
> >perspective.
> >
> >What do you think?
> >
> >Tomek
> >
> >
> >On Sun, Sep 6, 2020 at 2:42 PM Ash Berlin-Taylor <as...@apache.org>
> >wrote:
> >>
> >> For background: in the past I had an S3 to S3 transfer using
> >smartopen (since we wanted to split one giant ~300GB file onto smaller
> >parts) and it took about 10mins, so even "large" uses can work fine in
> >Airflow - no JVM required.
> >>
> >> -ash
> >>
> >> On 6 September 2020 12:01:24 BST, Tomasz Urbaszek
> ><tu...@apache.org> wrote:
> >> >I think using direct runner as default with the option to specify
> >> >other setup is a win-win. However, there are few doubts I have about
> >> >Beam based approach:
> >> >
> >> >1. Dependency management. If I do `pip install apache-airflow[gcp]`
> >> >will it install `apache-beam[gcp]`? What if there's a version clash
> >> >between dependencies?
> >> >
> >> >2. The initial approach using `DataSource` concept allowed users to
> >> >use it in any operator (not only transfer ones). In case of relying
> >on
> >> >Beam we are losing this.
> >> >
> >> >3. I'm not a Beam expert but it seems to not support any data
> >lineage
> >> >solution?
> >> >
> >> >
> >> >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
> >> ><da...@gmail.com> wrote:
> >> >>
> >> >> I think there are absolutely use-cases for both. I’m totally fine
> >> >with saying “for small/medium use-cases, we come with an in-house
> >> >system. However for larger cases, you’ll require spark/Flink/S3.
> >That’s
> >> >totally in line with PLENTY of use-cases. This would be especially
> >cool
> >> >when matched with fast-follow as we could EVEN potentially tie in
> >data
> >> >locality.
> >> >>
> >> >> via Newton Mail
> >>
> >>[
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> ]
> >> >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
> >> ><wh...@gmail.com> wrote:
> >> >> I believe - for not large data - the direct runner is wholly
> >doable,
> >> >which
> >> >> seems in line with airflow patterns. I have, and have spoken with
> >> >several
> >> >> others that have, been productive with that runner.
> >> >>
> >> >> For much larger transfers, the generic operator could accept
> >> >parameters for
> >> >> submitting the compute to an actual runner. Though, imagining that
> >> >> (needing a runner) would not be the primary use case for such an
> >> >operator.
> >> >>
> >> >>
> >> >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek
> ><tu...@apache.org>
> >> >wrote:
> >> >>
> >> >> > Austin, you are right, Beam covers all (and more) important IOs.
> >> >> > However, using Apache Beam to design a generic transfer operator
> >> >> > requires Airflow users to have additional resources that will be
> >> >used
> >> >> > as a runner (Spark, Flink, etc.). Unless you suggest using
> >> >> > DirectRunner?
> >> >> >
> >> >> > Can you please tell us more how exactly you think we can use
> >Beam
> >> >for
> >> >> > those Airflow transfer operators?
> >> >> >
> >> >> > Best,
> >> >> > Tomek
> >> >> >
> >> >> >
> >> >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> >> >> > <wh...@gmail.com> wrote:
> >> >> > >
> >> >> > > Are there IOs that would be desired for a generic transfer
> >> >operator that
> >> >> > > don't exist in:
> >> >https://beam.apache.org/documentation/io/built-in/ <-
> >> >> > > there is pretty solid coverage?
> >> >> > >
> >> >> > > Beam is getting to the point where even python beam can
> >leverage
> >> >the java
> >> >> > > IOs, which increases the range of IOs (and performance).
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
> >> ><Ja...@polidea.com>
> >> >> > > wrote:
> >> >> > >
> >> >> > > > But I believe those two ideas are separate ones as Tomek
> >> >explained :)
> >> >> > > >
> >> >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
> >> ><Jarek.Potiuk@polidea.com
> >> >> > >
> >> >> > > > wrote:
> >> >> > > >
> >> >> > > > > I love the idea of connecting the projects more closely!
> >> >> > > > >
> >> >> > > > > I've been helping recently as a consultant in improving
> >the
> >> >Apache
> >> >> > Beam
> >> >> > > > > build infrastructure (in many parts based on my Airflow
> >> >experience
> >> >> > and
> >> >> > > > > Github Actions - even recently they adopted the "cancel"
> >> >action I
> >> >> > > > developed
> >> >> > > > > for Apache Airflow).
> >> >https://github.com/apache/beam/pull/12729
> >> >> > > > >
> >> >> > > > > Synergies in Apache projects are cool.
> >> >> > > > >
> >> >> > > > > J.
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> >> >> > > > > <gc...@twitter.com.invalid> wrote:
> >> >> > > > >
> >> >> > > > >> Agree on keeping those separate, just intervened as I
> >> >believe its a
> >> >> > > > great
> >> >> > > > >> idea. But lets keep @beam and @spark to a separate
> >thread.
> >> >> > > > >>
> >> >> > > > >>
> >> >> > > > >> Gerard Casas Saez
> >> >> > > > >> Twitter | Cortex | @casassaez
> ><http://twitter.com/casassaez>
> >> >> > > > >>
> >> >> > > > >>
> >> >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> >> >> > turbaszek@apache.org>
> >> >> > > > >> wrote:
> >> >> > > > >>
> >> >> > > > >> > Daniel is right we have few Apache Beam committers in
> >> >Polidea so
> >> >> > we
> >> >> > > > >> > will ask for advice. However, I would be highly in
> >favor
> >> >of
> >> >> > having it
> >> >> > > > >> > as Gerard suggested as @beam decorator. This is
> >something
> >> >we
> >> >> > should
> >> >> > > > >> > put into another AIP together with the mentioned @spark
> >> >decorator.
> >> >> > > > >> >
> >> >> > > > >> > Our proposition of transfer operators was mainly to
> >create
> >> >> > something
> >> >> > > > >> > Airflow-native that works out of the box and allows us
> >to
> >> >simplify
> >> >> > > > >> > read/write from external sources. Thus, it requires no
> >> >external
> >> >> > > > >> > dependency other than the library to communicate with
> >the
> >> >API. In
> >> >> > the
> >> >> > > > >> > case of Beam we need more than that I think.
> >> >> > > > >> >
> >> >> > > > >> > Additionally, the ideas of Source and Destination play
> >> >nicely with
> >> >> > > > >> > data lineage and may bring more interest to this
> >feature
> >> >of
> >> >> > Airflow.
> >> >> > > > >> >
> >> >> > > > >> > Cheers,
> >> >> > > > >> > Tomek
> >> >> > > > >> >
> >> >> > > > >> >
> >> >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
> >> ><ka...@gmail.com>
> >> >> > > > wrote:
> >> >> > > > >> > >
> >> >> > > > >> > > Nice. Just a note here, we will need to make sure
> >that
> >> >those
> >> >> > > > "Source"
> >> >> > > > >> and
> >> >> > > > >> > > "Destination" needs to be serializable.
> >> >> > > > >> > >
> >> >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> >> >> > > > daniel.imberman@gmail.com
> >> >> > > > >> >
> >> >> > > > >> > > wrote:
> >> >> > > > >> > >
> >> >> > > > >> > > > Interesting! Beam also could potentially allow
> >> >transfers
> >> >> > within
> >> >> > > > >> > Dask/any
> >> >> > > > >> > > > other system with a java/python SDK? I think @jarek
> >> >and
> >> >> > Polidea
> >> >> > > > do a
> >> >> > > > >> > lot of
> >> >> > > > >> > > > work with Beam as well so I’d love their thoughts
> >if
> >> >this a
> >> >> > good
> >> >> > > > >> > use-case.
> >> >> > > > >> > > >
> >> >> > > > >> > > > via Newton Mail [
> >> >> > > > >> > > >
> >> >> > > > >> >
> >> >> > > > >>
> >> >> > > >
> >> >> >
> >>
> >>
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> >> >> > > > >> > > > ]
> >> >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez
> ><
> >> >> > > > >> > gcasassaez@twitter.com.invalid>
> >> >> > > > >> > > > wrote:
> >> >> > > > >> > > > I would be highly in favour of having a generic
> >Beam
> >> >operator.
> >> >> > > > >> Similar
> >> >> > > > >> > > > to @spark_task decorator. Something where you can
> >> >easily
> >> >> > define
> >> >> > > > and
> >> >> > > > >> > wrap a
> >> >> > > > >> > > > beam pipeline and convert it to an Airflow
> >operator.
> >> >> > > > >> > > >
> >> >> > > > >> > > > Gerard Casas Saez
> >> >> > > > >> > > > Twitter | Cortex | @casassaez
> >> ><http://twitter.com/casassaez>
> >> >> > > > >> > > >
> >> >> > > > >> > > >
> >> >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> >> >> > > > >> > > > whatwouldaustindo@gmail.com>
> >> >> > > > >> > > > wrote:
> >> >> > > > >> > > >
> >> >> > > > >> > > > > Are you guys familiar with Beam
> >> ><https://beam.apache.org>?
> >> >> > Esp.
> >> >> > > > >> if
> >> >> > > > >> > not
> >> >> > > > >> > > > > doing transforms, it might rather straightforward
> >to
> >> >rely
> >> >> > on the
> >> >> > > > >> > > > ecosystem
> >> >> > > > >> > > > > of connectors in that Apache Project to use as
> >the
> >> >> > foundations
> >> >> > > > >> for a
> >> >> > > > >> > > > > generic transfer operator.
> >> >> > > > >> > > > >
> >> >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> >> >> > > > >> > Jarek.Potiuk@polidea.com>
> >> >> > > > >> > > > > wrote:
> >> >> > > > >> > > > >
> >> >> > > > >> > > > > > +1
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski
> ><
> >> >> > > > >> > > > > > kamil.olszewski@polidea.com>
> >> >> > > > >> > > > > > wrote:
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > > > > Hello all,
> >> >> > > > >> > > > > > > since there have been no new comments shared
> >in
> >> >the POC
> >> >> > doc
> >> >> > > > >> > > > > > > <
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > >
> >> >> > > > >> > > >
> >> >> > > > >> >
> >> >> > > > >>
> >> >> > > >
> >> >> >
> >>
> >>
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> >> >> > > > >> > > > > > > >
> >> >> > > > >> > > > > > > for a couple of days, then I will proceed
> >with
> >> >creating
> >> >> > an
> >> >> > > > AIP
> >> >> > > > >> > for
> >> >> > > > >> > > > this
> >> >> > > > >> > > > > > > feature, if that is ok with everybody.
> >> >> > > > >> > > > > > > Best regards,
> >> >> > > > >> > > > > > > Kamil
> >> >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz
> >Urbaszek
> >> ><
> >> >> > > > >> > > > turbaszek@apache.org
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > > > > wrote:
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > > > > I like the approach as it itnroduces
> >another
> >> >> > interesting
> >> >> > > > >> > operators'
> >> >> > > > >> > > > > > > > interface standarization. It would be
> >awesome
> >> >to here
> >> >> > more
> >> >> > > > >> > opinions
> >> >> > > > >> > > > > :)
> >> >> > > > >> > > > > > > >
> >> >> > > > >> > > > > > > > Cheers,
> >> >> > > > >> > > > > > > > Tomek
> >> >> > > > >> > > > > > > >
> >> >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek
> >Potiuk <
> >> >> > > > >> > > > > Jarek.Potiuk@polidea.com
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > > > > wrote:
> >> >> > > > >> > > > > > > >
> >> >> > > > >> > > > > > > > > I like the idea a lot. Similar things
> >have
> >> >been
> >> >> > > > discussed
> >> >> > > > >> > before
> >> >> > > > >> > > > > but
> >> >> > > > >> > > > > > > the
> >> >> > > > >> > > > > > > > > proposal is I think rather pragmatic and
> >> >solves a
> >> >> > real
> >> >> > > > >> > problem
> >> >> > > > >> > > > (and
> >> >> > > > >> > > > > > it
> >> >> > > > >> > > > > > > > does
> >> >> > > > >> > > > > > > > > not seem to be too complex to implement)
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > There is some discussion about it already
> >in
> >> >the
> >> >> > > > document
> >> >> > > > >> > (please
> >> >> > > > >> > > > > > > > chime-in
> >> >> > > > >> > > > > > > > > for those interested) but here a few
> >points
> >> >why I
> >> >> > like
> >> >> > > > it:
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > - performance and optimization is not a
> >> >focus for
> >> >> > that.
> >> >> > > > >> For
> >> >> > > > >> > > > generic
> >> >> > > > >> > > > > > > stuff
> >> >> > > > >> > > > > > > > > it is usually to write "optimal" solution
> >> >but once
> >> >> > you
> >> >> > > > >> admit
> >> >> > > > >> > you
> >> >> > > > >> > > > > are
> >> >> > > > >> > > > > > > not
> >> >> > > > >> > > > > > > > > going to focus for optimisation, you come
> >> >with
> >> >> > simpler
> >> >> > > > and
> >> >> > > > >> > easier
> >> >> > > > >> > > > > to
> >> >> > > > >> > > > > > > use
> >> >> > > > >> > > > > > > > > solutions
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > - on the other hand - it uses very
> >> >"Python'y"
> >> >> > approach
> >> >> > > > >> with
> >> >> > > > >> > using
> >> >> > > > >> > > > > > > > > Airflow's familiar concepts (connection,
> >> >transfer)
> >> >> > and
> >> >> > > > has
> >> >> > > > >> > the
> >> >> > > > >> > > > > > > potential
> >> >> > > > >> > > > > > > > of
> >> >> > > > >> > > > > > > > > plugging in into 100s of hooks we have
> >> >already
> >> >> > easily -
> >> >> > > > >> > > > leveraging
> >> >> > > > >> > > > > > all
> >> >> > > > >> > > > > > > > the
> >> >> > > > >> > > > > > > > > "providers" richness of Airflow.
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > - it aims to be easy to do "quick start"
> >-
> >> >if you
> >> >> > have a
> >> >> > > > >> > number
> >> >> > > > >> > > > of
> >> >> > > > >> > > > > > > > > different sources/targets and as a data
> >> >scientist
> >> >> > you
> >> >> > > > >> would
> >> >> > > > >> > like
> >> >> > > > >> > > > to
> >> >> > > > >> > > > > > > > quickly
> >> >> > > > >> > > > > > > > > start transferring data between them -
> >you
> >> >can do it
> >> >> > > > >> easily
> >> >> > > > >> > with
> >> >> > > > >> > > > > > only
> >> >> > > > >> > > > > > > > > basic python knowledge and simple DAG
> >> >structure.
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > - it should be possible to plug it in
> >into
> >> >our new
> >> >> > > > >> functional
> >> >> > > > >> > > > > > approach
> >> >> > > > >> > > > > > > as
> >> >> > > > >> > > > > > > > > well as future lineage discussions as it
> >> >makes
> >> >> > > > connection
> >> >> > > > >> > between
> >> >> > > > >> > > > > > > sources
> >> >> > > > >> > > > > > > > > and targets
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > - it opens up possibilities of adding
> >simple
> >> >and
> >> >> > > > flexible
> >> >> > > > >> > data
> >> >> > > > >> > > > > > > > > transformation on-transfer. Not a
> >> >replacement for
> >> >> > any of
> >> >> > > > >> the
> >> >> > > > >> > > > > external
> >> >> > > > >> > > > > > > > > services that Airflow should use (Airflow
> >is
> >> >an
> >> >> > > > >> > orchestrator, not
> >> >> > > > >> > > > > > data
> >> >> > > > >> > > > > > > > > processing solution) but for the kind of
> >> >quick-start
> >> >> > > > >> > scenarios I
> >> >> > > > >> > > > > > > foresee
> >> >> > > > >> > > > > > > > it
> >> >> > > > >> > > > > > > > > might be most useful, being able to apply
> >> >simple
> >> >> > data
> >> >> > > > >> > > > > transformation
> >> >> > > > >> > > > > > on
> >> >> > > > >> > > > > > > > the
> >> >> > > > >> > > > > > > > > fly by data scientist might be a big
> >plus.
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format
> >of
> >> >the
> >> >> > "data"
> >> >> > > > >> > component
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > Kamil - you should have access now.
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > J.
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
> >> >Olszewski <
> >> >> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
> >> >> > > > >> > > > > > > > > wrote:
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > > Hello all,
> >> >> > > > >> > > > > > > > > > in Polidea we have come up with an idea
> >> >for a
> >> >> > generic
> >> >> > > > >> > transfer
> >> >> > > > >> > > > > > > operator
> >> >> > > > >> > > > > > > > > > that would be able to transport data
> >> >between two
> >> >> > > > >> > destinations
> >> >> > > > >> > > > of
> >> >> > > > >> > > > > > > > various
> >> >> > > > >> > > > > > > > > > types (file, database, storage, etc.) -
> >> >please
> >> >> > find
> >> >> > > > the
> >> >> > > > >> > link
> >> >> > > > >> > > > > with a
> >> >> > > > >> > > > > > > > short
> >> >> > > > >> > > > > > > > > > doc with POC
> >> >> > > > >> > > > > > > > > > <
> >> >> > > > >> > > > > > > > > >
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > >
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > >
> >> >> > > > >> > > >
> >> >> > > > >> >
> >> >> > > > >>
> >> >> > > >
> >> >> >
> >>
> >>
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> >> >> > > > >> > > > > > > > > > >
> >> >> > > > >> > > > > > > > > > where we can discuss the design
> >initially.
> >> >Once we
> >> >> > > > come
> >> >> > > > >> to
> >> >> > > > >> > the
> >> >> > > > >> > > > > > > initial
> >> >> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki
> >-
> >> >can I
> >> >> > ask
> >> >> > > > for
> >> >> > > > >> > > > > permission
> >> >> > > > >> > > > > > to
> >> >> > > > >> > > > > > > > do
> >> >> > > > >> > > > > > > > > so
> >> >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe
> >> >that
> >> >> > during
> >> >> > > > the
> >> >> > > > >> > > > > discussion
> >> >> > > > >> > > > > > we
> >> >> > > > >> > > > > > > > > > should definitely aim for this feature
> >to
> >> >be
> >> >> > released
> >> >> > > > >> only
> >> >> > > > >> > > > after
> >> >> > > > >> > > > > > > > Airflow
> >> >> > > > >> > > > > > > > > > 2.0 is out.
> >> >> > > > >> > > > > > > > > >
> >> >> > > > >> > > > > > > > > > What do you think about this idea?
> >Would
> >> >you find
> >> >> > such
> >> >> > > > >> an
> >> >> > > > >> > > > > operator
> >> >> > > > >> > > > > > > > > helpful
> >> >> > > > >> > > > > > > > > > in your pipelines? Maybe you already
> >use a
> >> >similar
> >> >> > > > >> > solution or
> >> >> > > > >> > > > > know
> >> >> > > > >> > > > > > > > > > packages that could be used to
> >implement
> >> >it?
> >> >> > > > >> > > > > > > > > >
> >> >> > > > >> > > > > > > > > > Best regards,
> >> >> > > > >> > > > > > > > > > --
> >> >> > > > >> > > > > > > > > >
> >> >> > > > >> > > > > > > > > > Kamil Olszewski
> >> >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
> >> >Software
> >> >> > Engineer
> >> >> > > > >> > > > > > > > > >
> >> >> > > > >> > > > > > > > > > M: +48 503 361 783
> >> >> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> >> >> > > > >> > > > > > > > > >
> >> >> > > > >> > > > > > > > > > Unique Tech
> >> >> > > > >> > > > > > > > > > Check out our projects! <
> >> >> > > > >> https://www.polidea.com/our-work>
> >> >> > > > >> > > > > > > > > >
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > --
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > Jarek Potiuk
> >> >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
> >> >Principal
> >> >> > Software
> >> >> > > > >> > Engineer
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> >> >> > > > >> > > > > > > > > [image: Polidea]
> ><https://www.polidea.com/>
> >> >> > > > >> > > > > > > > >
> >> >> > > > >> > > > > > > >
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > > > --
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > > > Kamil Olszewski
> >> >> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software
> >> >Engineer
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > > > M: +48 503 361 783
> >> >> > > > >> > > > > > > E: kamil.olszewski@polidea.com
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > > > Unique Tech
> >> >> > > > >> > > > > > > Check out our projects! <
> >> >> > https://www.polidea.com/our-work>
> >> >> > > > >> > > > > > >
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > > > --
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > > > Jarek Potiuk
> >> >> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal
> >> >Software
> >> >> > > > >> Engineer
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
> >> >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> >> >> > > > >> > > > > >
> >> >> > > > >> > > > >
> >> >> > > > >> >
> >> >> > > > >> >
> >> >> > > > >> >
> >> >> > > > >> > --
> >> >> > > > >> >
> >> >> > > > >> > Tomasz Urbaszek
> >> >> > > > >> > Polidea | Software Engineer
> >> >> > > > >> >
> >> >> > > > >> > M: +48 505 628 493
> >> >> > > > >> > E: tomasz.urbaszek@polidea.com
> >> >> > > > >> >
> >> >> > > > >> > Unique Tech
> >> >> > > > >> > Check out our projects!
> >> >> > > > >> >
> >> >> > > > >>
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > --
> >> >> > > > >
> >> >> > > > > Jarek Potiuk
> >> >> > > > > Polidea <https://www.polidea.com/> | Principal Software
> >> >Engineer
> >> >> > > > >
> >> >> > > > > M: +48 660 796 129 <+48660796129>
> >> >> > > > > [image: Polidea] <https://www.polidea.com/>
> >> >> > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > > > --
> >> >> > > >
> >> >> > > > Jarek Potiuk
> >> >> > > > Polidea <https://www.polidea.com/> | Principal Software
> >> >Engineer
> >> >> > > >
> >> >> > > > M: +48 660 796 129 <+48660796129>
> >> >> > > > [image: Polidea] <https://www.polidea.com/>
> >> >> > > >
> >> >> >
>

Re: Generic Transfer Operator

Posted by Ash Berlin-Taylor <as...@apache.org>.
No strong opinion - but it seems like generic is the easiest for us to code (as we have most of it already via hooks?) and adopt (and doesn't place a hard requirement on Beam/JVM, even if JVM would only be runtime. Still)

This is possibly where Airflow has a core TransferOperator, and providers.apache.beam.operators.BeamTransferOperator? If the "same" python API could be used for both, and it doesn't needlessly complicated things.

-a

On 6 September 2020 16:20:37 BST, Tomasz Urbaszek <tu...@apache.org> wrote:
>Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This
>one looks really interesting for blob storages transfer!
>
>As stated in the initial design doc I don't think we should focus on
>best performance but rather on versatility. Currently, we have many
>AtoB operators that do not yield the highest performance but do their
>work and are widely used.
>
>I would say that we should prepare an AIP that will propose two
>approaches: generic vs beam. This will allow us to compare them and
>then we can vote which one is better from the Airflow community
>perspective.
>
>What do you think?
>
>Tomek
>
>
>On Sun, Sep 6, 2020 at 2:42 PM Ash Berlin-Taylor <as...@apache.org>
>wrote:
>>
>> For background: in the past I had an S3 to S3 transfer using
>smartopen (since we wanted to split one giant ~300GB file onto smaller
>parts) and it took about 10mins, so even "large" uses can work fine in
>Airflow - no JVM required.
>>
>> -ash
>>
>> On 6 September 2020 12:01:24 BST, Tomasz Urbaszek
><tu...@apache.org> wrote:
>> >I think using direct runner as default with the option to specify
>> >other setup is a win-win. However, there are few doubts I have about
>> >Beam based approach:
>> >
>> >1. Dependency management. If I do `pip install apache-airflow[gcp]`
>> >will it install `apache-beam[gcp]`? What if there's a version clash
>> >between dependencies?
>> >
>> >2. The initial approach using `DataSource` concept allowed users to
>> >use it in any operator (not only transfer ones). In case of relying
>on
>> >Beam we are losing this.
>> >
>> >3. I'm not a Beam expert but it seems to not support any data
>lineage
>> >solution?
>> >
>> >
>> >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
>> ><da...@gmail.com> wrote:
>> >>
>> >> I think there are absolutely use-cases for both. I’m totally fine
>> >with saying “for small/medium use-cases, we come with an in-house
>> >system. However for larger cases, you’ll require spark/Flink/S3.
>That’s
>> >totally in line with PLENTY of use-cases. This would be especially
>cool
>> >when matched with fast-follow as we could EVEN potentially tie in
>data
>> >locality.
>> >>
>> >> via Newton Mail
>>
>>[https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
>> >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
>> ><wh...@gmail.com> wrote:
>> >> I believe - for not large data - the direct runner is wholly
>doable,
>> >which
>> >> seems in line with airflow patterns. I have, and have spoken with
>> >several
>> >> others that have, been productive with that runner.
>> >>
>> >> For much larger transfers, the generic operator could accept
>> >parameters for
>> >> submitting the compute to an actual runner. Though, imagining that
>> >> (needing a runner) would not be the primary use case for such an
>> >operator.
>> >>
>> >>
>> >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek
><tu...@apache.org>
>> >wrote:
>> >>
>> >> > Austin, you are right, Beam covers all (and more) important IOs.
>> >> > However, using Apache Beam to design a generic transfer operator
>> >> > requires Airflow users to have additional resources that will be
>> >used
>> >> > as a runner (Spark, Flink, etc.). Unless you suggest using
>> >> > DirectRunner?
>> >> >
>> >> > Can you please tell us more how exactly you think we can use
>Beam
>> >for
>> >> > those Airflow transfer operators?
>> >> >
>> >> > Best,
>> >> > Tomek
>> >> >
>> >> >
>> >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
>> >> > <wh...@gmail.com> wrote:
>> >> > >
>> >> > > Are there IOs that would be desired for a generic transfer
>> >operator that
>> >> > > don't exist in:
>> >https://beam.apache.org/documentation/io/built-in/ <-
>> >> > > there is pretty solid coverage?
>> >> > >
>> >> > > Beam is getting to the point where even python beam can
>leverage
>> >the java
>> >> > > IOs, which increases the range of IOs (and performance).
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
>> ><Ja...@polidea.com>
>> >> > > wrote:
>> >> > >
>> >> > > > But I believe those two ideas are separate ones as Tomek
>> >explained :)
>> >> > > >
>> >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
>> ><Jarek.Potiuk@polidea.com
>> >> > >
>> >> > > > wrote:
>> >> > > >
>> >> > > > > I love the idea of connecting the projects more closely!
>> >> > > > >
>> >> > > > > I've been helping recently as a consultant in improving
>the
>> >Apache
>> >> > Beam
>> >> > > > > build infrastructure (in many parts based on my Airflow
>> >experience
>> >> > and
>> >> > > > > Github Actions - even recently they adopted the "cancel"
>> >action I
>> >> > > > developed
>> >> > > > > for Apache Airflow).
>> >https://github.com/apache/beam/pull/12729
>> >> > > > >
>> >> > > > > Synergies in Apache projects are cool.
>> >> > > > >
>> >> > > > > J.
>> >> > > > >
>> >> > > > >
>> >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
>> >> > > > > <gc...@twitter.com.invalid> wrote:
>> >> > > > >
>> >> > > > >> Agree on keeping those separate, just intervened as I
>> >believe its a
>> >> > > > great
>> >> > > > >> idea. But lets keep @beam and @spark to a separate
>thread.
>> >> > > > >>
>> >> > > > >>
>> >> > > > >> Gerard Casas Saez
>> >> > > > >> Twitter | Cortex | @casassaez
><http://twitter.com/casassaez>
>> >> > > > >>
>> >> > > > >>
>> >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
>> >> > turbaszek@apache.org>
>> >> > > > >> wrote:
>> >> > > > >>
>> >> > > > >> > Daniel is right we have few Apache Beam committers in
>> >Polidea so
>> >> > we
>> >> > > > >> > will ask for advice. However, I would be highly in
>favor
>> >of
>> >> > having it
>> >> > > > >> > as Gerard suggested as @beam decorator. This is
>something
>> >we
>> >> > should
>> >> > > > >> > put into another AIP together with the mentioned @spark
>> >decorator.
>> >> > > > >> >
>> >> > > > >> > Our proposition of transfer operators was mainly to
>create
>> >> > something
>> >> > > > >> > Airflow-native that works out of the box and allows us
>to
>> >simplify
>> >> > > > >> > read/write from external sources. Thus, it requires no
>> >external
>> >> > > > >> > dependency other than the library to communicate with
>the
>> >API. In
>> >> > the
>> >> > > > >> > case of Beam we need more than that I think.
>> >> > > > >> >
>> >> > > > >> > Additionally, the ideas of Source and Destination play
>> >nicely with
>> >> > > > >> > data lineage and may bring more interest to this
>feature
>> >of
>> >> > Airflow.
>> >> > > > >> >
>> >> > > > >> > Cheers,
>> >> > > > >> > Tomek
>> >> > > > >> >
>> >> > > > >> >
>> >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
>> ><ka...@gmail.com>
>> >> > > > wrote:
>> >> > > > >> > >
>> >> > > > >> > > Nice. Just a note here, we will need to make sure
>that
>> >those
>> >> > > > "Source"
>> >> > > > >> and
>> >> > > > >> > > "Destination" needs to be serializable.
>> >> > > > >> > >
>> >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
>> >> > > > daniel.imberman@gmail.com
>> >> > > > >> >
>> >> > > > >> > > wrote:
>> >> > > > >> > >
>> >> > > > >> > > > Interesting! Beam also could potentially allow
>> >transfers
>> >> > within
>> >> > > > >> > Dask/any
>> >> > > > >> > > > other system with a java/python SDK? I think @jarek
>> >and
>> >> > Polidea
>> >> > > > do a
>> >> > > > >> > lot of
>> >> > > > >> > > > work with Beam as well so I’d love their thoughts
>if
>> >this a
>> >> > good
>> >> > > > >> > use-case.
>> >> > > > >> > > >
>> >> > > > >> > > > via Newton Mail [
>> >> > > > >> > > >
>> >> > > > >> >
>> >> > > > >>
>> >> > > >
>> >> >
>>
>>https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
>> >> > > > >> > > > ]
>> >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez
><
>> >> > > > >> > gcasassaez@twitter.com.invalid>
>> >> > > > >> > > > wrote:
>> >> > > > >> > > > I would be highly in favour of having a generic
>Beam
>> >operator.
>> >> > > > >> Similar
>> >> > > > >> > > > to @spark_task decorator. Something where you can
>> >easily
>> >> > define
>> >> > > > and
>> >> > > > >> > wrap a
>> >> > > > >> > > > beam pipeline and convert it to an Airflow
>operator.
>> >> > > > >> > > >
>> >> > > > >> > > > Gerard Casas Saez
>> >> > > > >> > > > Twitter | Cortex | @casassaez
>> ><http://twitter.com/casassaez>
>> >> > > > >> > > >
>> >> > > > >> > > >
>> >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
>> >> > > > >> > > > whatwouldaustindo@gmail.com>
>> >> > > > >> > > > wrote:
>> >> > > > >> > > >
>> >> > > > >> > > > > Are you guys familiar with Beam
>> ><https://beam.apache.org>?
>> >> > Esp.
>> >> > > > >> if
>> >> > > > >> > not
>> >> > > > >> > > > > doing transforms, it might rather straightforward
>to
>> >rely
>> >> > on the
>> >> > > > >> > > > ecosystem
>> >> > > > >> > > > > of connectors in that Apache Project to use as
>the
>> >> > foundations
>> >> > > > >> for a
>> >> > > > >> > > > > generic transfer operator.
>> >> > > > >> > > > >
>> >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
>> >> > > > >> > Jarek.Potiuk@polidea.com>
>> >> > > > >> > > > > wrote:
>> >> > > > >> > > > >
>> >> > > > >> > > > > > +1
>> >> > > > >> > > > > >
>> >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski
><
>> >> > > > >> > > > > > kamil.olszewski@polidea.com>
>> >> > > > >> > > > > > wrote:
>> >> > > > >> > > > > >
>> >> > > > >> > > > > > > Hello all,
>> >> > > > >> > > > > > > since there have been no new comments shared
>in
>> >the POC
>> >> > doc
>> >> > > > >> > > > > > > <
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > >
>> >> > > > >> > > > >
>> >> > > > >> > > >
>> >> > > > >> >
>> >> > > > >>
>> >> > > >
>> >> >
>>
>>https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
>> >> > > > >> > > > > > > >
>> >> > > > >> > > > > > > for a couple of days, then I will proceed
>with
>> >creating
>> >> > an
>> >> > > > AIP
>> >> > > > >> > for
>> >> > > > >> > > > this
>> >> > > > >> > > > > > > feature, if that is ok with everybody.
>> >> > > > >> > > > > > > Best regards,
>> >> > > > >> > > > > > > Kamil
>> >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz
>Urbaszek
>> ><
>> >> > > > >> > > > turbaszek@apache.org
>> >> > > > >> > > > > >
>> >> > > > >> > > > > > > wrote:
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > > > > I like the approach as it itnroduces
>another
>> >> > interesting
>> >> > > > >> > operators'
>> >> > > > >> > > > > > > > interface standarization. It would be
>awesome
>> >to here
>> >> > more
>> >> > > > >> > opinions
>> >> > > > >> > > > > :)
>> >> > > > >> > > > > > > >
>> >> > > > >> > > > > > > > Cheers,
>> >> > > > >> > > > > > > > Tomek
>> >> > > > >> > > > > > > >
>> >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek
>Potiuk <
>> >> > > > >> > > > > Jarek.Potiuk@polidea.com
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > > > > wrote:
>> >> > > > >> > > > > > > >
>> >> > > > >> > > > > > > > > I like the idea a lot. Similar things
>have
>> >been
>> >> > > > discussed
>> >> > > > >> > before
>> >> > > > >> > > > > but
>> >> > > > >> > > > > > > the
>> >> > > > >> > > > > > > > > proposal is I think rather pragmatic and
>> >solves a
>> >> > real
>> >> > > > >> > problem
>> >> > > > >> > > > (and
>> >> > > > >> > > > > > it
>> >> > > > >> > > > > > > > does
>> >> > > > >> > > > > > > > > not seem to be too complex to implement)
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > There is some discussion about it already
>in
>> >the
>> >> > > > document
>> >> > > > >> > (please
>> >> > > > >> > > > > > > > chime-in
>> >> > > > >> > > > > > > > > for those interested) but here a few
>points
>> >why I
>> >> > like
>> >> > > > it:
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > - performance and optimization is not a
>> >focus for
>> >> > that.
>> >> > > > >> For
>> >> > > > >> > > > generic
>> >> > > > >> > > > > > > stuff
>> >> > > > >> > > > > > > > > it is usually to write "optimal" solution
>> >but once
>> >> > you
>> >> > > > >> admit
>> >> > > > >> > you
>> >> > > > >> > > > > are
>> >> > > > >> > > > > > > not
>> >> > > > >> > > > > > > > > going to focus for optimisation, you come
>> >with
>> >> > simpler
>> >> > > > and
>> >> > > > >> > easier
>> >> > > > >> > > > > to
>> >> > > > >> > > > > > > use
>> >> > > > >> > > > > > > > > solutions
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > - on the other hand - it uses very
>> >"Python'y"
>> >> > approach
>> >> > > > >> with
>> >> > > > >> > using
>> >> > > > >> > > > > > > > > Airflow's familiar concepts (connection,
>> >transfer)
>> >> > and
>> >> > > > has
>> >> > > > >> > the
>> >> > > > >> > > > > > > potential
>> >> > > > >> > > > > > > > of
>> >> > > > >> > > > > > > > > plugging in into 100s of hooks we have
>> >already
>> >> > easily -
>> >> > > > >> > > > leveraging
>> >> > > > >> > > > > > all
>> >> > > > >> > > > > > > > the
>> >> > > > >> > > > > > > > > "providers" richness of Airflow.
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > - it aims to be easy to do "quick start"
>-
>> >if you
>> >> > have a
>> >> > > > >> > number
>> >> > > > >> > > > of
>> >> > > > >> > > > > > > > > different sources/targets and as a data
>> >scientist
>> >> > you
>> >> > > > >> would
>> >> > > > >> > like
>> >> > > > >> > > > to
>> >> > > > >> > > > > > > > quickly
>> >> > > > >> > > > > > > > > start transferring data between them -
>you
>> >can do it
>> >> > > > >> easily
>> >> > > > >> > with
>> >> > > > >> > > > > > only
>> >> > > > >> > > > > > > > > basic python knowledge and simple DAG
>> >structure.
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > - it should be possible to plug it in
>into
>> >our new
>> >> > > > >> functional
>> >> > > > >> > > > > > approach
>> >> > > > >> > > > > > > as
>> >> > > > >> > > > > > > > > well as future lineage discussions as it
>> >makes
>> >> > > > connection
>> >> > > > >> > between
>> >> > > > >> > > > > > > sources
>> >> > > > >> > > > > > > > > and targets
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > - it opens up possibilities of adding
>simple
>> >and
>> >> > > > flexible
>> >> > > > >> > data
>> >> > > > >> > > > > > > > > transformation on-transfer. Not a
>> >replacement for
>> >> > any of
>> >> > > > >> the
>> >> > > > >> > > > > external
>> >> > > > >> > > > > > > > > services that Airflow should use (Airflow
>is
>> >an
>> >> > > > >> > orchestrator, not
>> >> > > > >> > > > > > data
>> >> > > > >> > > > > > > > > processing solution) but for the kind of
>> >quick-start
>> >> > > > >> > scenarios I
>> >> > > > >> > > > > > > foresee
>> >> > > > >> > > > > > > > it
>> >> > > > >> > > > > > > > > might be most useful, being able to apply
>> >simple
>> >> > data
>> >> > > > >> > > > > transformation
>> >> > > > >> > > > > > on
>> >> > > > >> > > > > > > > the
>> >> > > > >> > > > > > > > > fly by data scientist might be a big
>plus.
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format
>of
>> >the
>> >> > "data"
>> >> > > > >> > component
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > Kamil - you should have access now.
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > J.
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
>> >Olszewski <
>> >> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
>> >> > > > >> > > > > > > > > wrote:
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > > Hello all,
>> >> > > > >> > > > > > > > > > in Polidea we have come up with an idea
>> >for a
>> >> > generic
>> >> > > > >> > transfer
>> >> > > > >> > > > > > > operator
>> >> > > > >> > > > > > > > > > that would be able to transport data
>> >between two
>> >> > > > >> > destinations
>> >> > > > >> > > > of
>> >> > > > >> > > > > > > > various
>> >> > > > >> > > > > > > > > > types (file, database, storage, etc.) -
>> >please
>> >> > find
>> >> > > > the
>> >> > > > >> > link
>> >> > > > >> > > > > with a
>> >> > > > >> > > > > > > > short
>> >> > > > >> > > > > > > > > > doc with POC
>> >> > > > >> > > > > > > > > > <
>> >> > > > >> > > > > > > > > >
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > >
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > >
>> >> > > > >> > > > >
>> >> > > > >> > > >
>> >> > > > >> >
>> >> > > > >>
>> >> > > >
>> >> >
>>
>>https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
>> >> > > > >> > > > > > > > > > >
>> >> > > > >> > > > > > > > > > where we can discuss the design
>initially.
>> >Once we
>> >> > > > come
>> >> > > > >> to
>> >> > > > >> > the
>> >> > > > >> > > > > > > initial
>> >> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki
>-
>> >can I
>> >> > ask
>> >> > > > for
>> >> > > > >> > > > > permission
>> >> > > > >> > > > > > to
>> >> > > > >> > > > > > > > do
>> >> > > > >> > > > > > > > > so
>> >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe
>> >that
>> >> > during
>> >> > > > the
>> >> > > > >> > > > > discussion
>> >> > > > >> > > > > > we
>> >> > > > >> > > > > > > > > > should definitely aim for this feature
>to
>> >be
>> >> > released
>> >> > > > >> only
>> >> > > > >> > > > after
>> >> > > > >> > > > > > > > Airflow
>> >> > > > >> > > > > > > > > > 2.0 is out.
>> >> > > > >> > > > > > > > > >
>> >> > > > >> > > > > > > > > > What do you think about this idea?
>Would
>> >you find
>> >> > such
>> >> > > > >> an
>> >> > > > >> > > > > operator
>> >> > > > >> > > > > > > > > helpful
>> >> > > > >> > > > > > > > > > in your pipelines? Maybe you already
>use a
>> >similar
>> >> > > > >> > solution or
>> >> > > > >> > > > > know
>> >> > > > >> > > > > > > > > > packages that could be used to
>implement
>> >it?
>> >> > > > >> > > > > > > > > >
>> >> > > > >> > > > > > > > > > Best regards,
>> >> > > > >> > > > > > > > > > --
>> >> > > > >> > > > > > > > > >
>> >> > > > >> > > > > > > > > > Kamil Olszewski
>> >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
>> >Software
>> >> > Engineer
>> >> > > > >> > > > > > > > > >
>> >> > > > >> > > > > > > > > > M: +48 503 361 783
>> >> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
>> >> > > > >> > > > > > > > > >
>> >> > > > >> > > > > > > > > > Unique Tech
>> >> > > > >> > > > > > > > > > Check out our projects! <
>> >> > > > >> https://www.polidea.com/our-work>
>> >> > > > >> > > > > > > > > >
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > --
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > Jarek Potiuk
>> >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
>> >Principal
>> >> > Software
>> >> > > > >> > Engineer
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
>> >> > > > >> > > > > > > > > [image: Polidea]
><https://www.polidea.com/>
>> >> > > > >> > > > > > > > >
>> >> > > > >> > > > > > > >
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > > > --
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > > > Kamil Olszewski
>> >> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software
>> >Engineer
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > > > M: +48 503 361 783
>> >> > > > >> > > > > > > E: kamil.olszewski@polidea.com
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > > > Unique Tech
>> >> > > > >> > > > > > > Check out our projects! <
>> >> > https://www.polidea.com/our-work>
>> >> > > > >> > > > > > >
>> >> > > > >> > > > > >
>> >> > > > >> > > > > >
>> >> > > > >> > > > > > --
>> >> > > > >> > > > > >
>> >> > > > >> > > > > > Jarek Potiuk
>> >> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal
>> >Software
>> >> > > > >> Engineer
>> >> > > > >> > > > > >
>> >> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
>> >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
>> >> > > > >> > > > > >
>> >> > > > >> > > > >
>> >> > > > >> >
>> >> > > > >> >
>> >> > > > >> >
>> >> > > > >> > --
>> >> > > > >> >
>> >> > > > >> > Tomasz Urbaszek
>> >> > > > >> > Polidea | Software Engineer
>> >> > > > >> >
>> >> > > > >> > M: +48 505 628 493
>> >> > > > >> > E: tomasz.urbaszek@polidea.com
>> >> > > > >> >
>> >> > > > >> > Unique Tech
>> >> > > > >> > Check out our projects!
>> >> > > > >> >
>> >> > > > >>
>> >> > > > >
>> >> > > > >
>> >> > > > > --
>> >> > > > >
>> >> > > > > Jarek Potiuk
>> >> > > > > Polidea <https://www.polidea.com/> | Principal Software
>> >Engineer
>> >> > > > >
>> >> > > > > M: +48 660 796 129 <+48660796129>
>> >> > > > > [image: Polidea] <https://www.polidea.com/>
>> >> > > > >
>> >> > > > >
>> >> > > >
>> >> > > > --
>> >> > > >
>> >> > > > Jarek Potiuk
>> >> > > > Polidea <https://www.polidea.com/> | Principal Software
>> >Engineer
>> >> > > >
>> >> > > > M: +48 660 796 129 <+48660796129>
>> >> > > > [image: Polidea] <https://www.polidea.com/>
>> >> > > >
>> >> >

Re: Generic Transfer Operator

Posted by Tomasz Urbaszek <tu...@apache.org>.
Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This
one looks really interesting for blob storages transfer!

As stated in the initial design doc I don't think we should focus on
best performance but rather on versatility. Currently, we have many
AtoB operators that do not yield the highest performance but do their
work and are widely used.

I would say that we should prepare an AIP that will propose two
approaches: generic vs beam. This will allow us to compare them and
then we can vote which one is better from the Airflow community
perspective.

What do you think?

Tomek


On Sun, Sep 6, 2020 at 2:42 PM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> For background: in the past I had an S3 to S3 transfer using smartopen (since we wanted to split one giant ~300GB file onto smaller parts) and it took about 10mins, so even "large" uses can work fine in Airflow - no JVM required.
>
> -ash
>
> On 6 September 2020 12:01:24 BST, Tomasz Urbaszek <tu...@apache.org> wrote:
> >I think using direct runner as default with the option to specify
> >other setup is a win-win. However, there are few doubts I have about
> >Beam based approach:
> >
> >1. Dependency management. If I do `pip install apache-airflow[gcp]`
> >will it install `apache-beam[gcp]`? What if there's a version clash
> >between dependencies?
> >
> >2. The initial approach using `DataSource` concept allowed users to
> >use it in any operator (not only transfer ones). In case of relying on
> >Beam we are losing this.
> >
> >3. I'm not a Beam expert but it seems to not support any data lineage
> >solution?
> >
> >
> >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
> ><da...@gmail.com> wrote:
> >>
> >> I think there are absolutely use-cases for both. I’m totally fine
> >with saying “for small/medium use-cases, we come with an in-house
> >system. However for larger cases, you’ll require spark/Flink/S3. That’s
> >totally in line with PLENTY of use-cases. This would be especially cool
> >when matched with fast-follow as we could EVEN potentially tie in data
> >locality.
> >>
> >> via Newton Mail
> >[https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
> >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
> ><wh...@gmail.com> wrote:
> >> I believe - for not large data - the direct runner is wholly doable,
> >which
> >> seems in line with airflow patterns. I have, and have spoken with
> >several
> >> others that have, been productive with that runner.
> >>
> >> For much larger transfers, the generic operator could accept
> >parameters for
> >> submitting the compute to an actual runner. Though, imagining that
> >> (needing a runner) would not be the primary use case for such an
> >operator.
> >>
> >>
> >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <tu...@apache.org>
> >wrote:
> >>
> >> > Austin, you are right, Beam covers all (and more) important IOs.
> >> > However, using Apache Beam to design a generic transfer operator
> >> > requires Airflow users to have additional resources that will be
> >used
> >> > as a runner (Spark, Flink, etc.). Unless you suggest using
> >> > DirectRunner?
> >> >
> >> > Can you please tell us more how exactly you think we can use Beam
> >for
> >> > those Airflow transfer operators?
> >> >
> >> > Best,
> >> > Tomek
> >> >
> >> >
> >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> >> > <wh...@gmail.com> wrote:
> >> > >
> >> > > Are there IOs that would be desired for a generic transfer
> >operator that
> >> > > don't exist in:
> >https://beam.apache.org/documentation/io/built-in/ <-
> >> > > there is pretty solid coverage?
> >> > >
> >> > > Beam is getting to the point where even python beam can leverage
> >the java
> >> > > IOs, which increases the range of IOs (and performance).
> >> > >
> >> > >
> >> > >
> >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
> ><Ja...@polidea.com>
> >> > > wrote:
> >> > >
> >> > > > But I believe those two ideas are separate ones as Tomek
> >explained :)
> >> > > >
> >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
> ><Jarek.Potiuk@polidea.com
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > I love the idea of connecting the projects more closely!
> >> > > > >
> >> > > > > I've been helping recently as a consultant in improving the
> >Apache
> >> > Beam
> >> > > > > build infrastructure (in many parts based on my Airflow
> >experience
> >> > and
> >> > > > > Github Actions - even recently they adopted the "cancel"
> >action I
> >> > > > developed
> >> > > > > for Apache Airflow).
> >https://github.com/apache/beam/pull/12729
> >> > > > >
> >> > > > > Synergies in Apache projects are cool.
> >> > > > >
> >> > > > > J.
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> >> > > > > <gc...@twitter.com.invalid> wrote:
> >> > > > >
> >> > > > >> Agree on keeping those separate, just intervened as I
> >believe its a
> >> > > > great
> >> > > > >> idea. But lets keep @beam and @spark to a separate thread.
> >> > > > >>
> >> > > > >>
> >> > > > >> Gerard Casas Saez
> >> > > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> >> > > > >>
> >> > > > >>
> >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> >> > turbaszek@apache.org>
> >> > > > >> wrote:
> >> > > > >>
> >> > > > >> > Daniel is right we have few Apache Beam committers in
> >Polidea so
> >> > we
> >> > > > >> > will ask for advice. However, I would be highly in favor
> >of
> >> > having it
> >> > > > >> > as Gerard suggested as @beam decorator. This is something
> >we
> >> > should
> >> > > > >> > put into another AIP together with the mentioned @spark
> >decorator.
> >> > > > >> >
> >> > > > >> > Our proposition of transfer operators was mainly to create
> >> > something
> >> > > > >> > Airflow-native that works out of the box and allows us to
> >simplify
> >> > > > >> > read/write from external sources. Thus, it requires no
> >external
> >> > > > >> > dependency other than the library to communicate with the
> >API. In
> >> > the
> >> > > > >> > case of Beam we need more than that I think.
> >> > > > >> >
> >> > > > >> > Additionally, the ideas of Source and Destination play
> >nicely with
> >> > > > >> > data lineage and may bring more interest to this feature
> >of
> >> > Airflow.
> >> > > > >> >
> >> > > > >> > Cheers,
> >> > > > >> > Tomek
> >> > > > >> >
> >> > > > >> >
> >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
> ><ka...@gmail.com>
> >> > > > wrote:
> >> > > > >> > >
> >> > > > >> > > Nice. Just a note here, we will need to make sure that
> >those
> >> > > > "Source"
> >> > > > >> and
> >> > > > >> > > "Destination" needs to be serializable.
> >> > > > >> > >
> >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> >> > > > daniel.imberman@gmail.com
> >> > > > >> >
> >> > > > >> > > wrote:
> >> > > > >> > >
> >> > > > >> > > > Interesting! Beam also could potentially allow
> >transfers
> >> > within
> >> > > > >> > Dask/any
> >> > > > >> > > > other system with a java/python SDK? I think @jarek
> >and
> >> > Polidea
> >> > > > do a
> >> > > > >> > lot of
> >> > > > >> > > > work with Beam as well so I’d love their thoughts if
> >this a
> >> > good
> >> > > > >> > use-case.
> >> > > > >> > > >
> >> > > > >> > > > via Newton Mail [
> >> > > > >> > > >
> >> > > > >> >
> >> > > > >>
> >> > > >
> >> >
> >https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> >> > > > >> > > > ]
> >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> >> > > > >> > gcasassaez@twitter.com.invalid>
> >> > > > >> > > > wrote:
> >> > > > >> > > > I would be highly in favour of having a generic Beam
> >operator.
> >> > > > >> Similar
> >> > > > >> > > > to @spark_task decorator. Something where you can
> >easily
> >> > define
> >> > > > and
> >> > > > >> > wrap a
> >> > > > >> > > > beam pipeline and convert it to an Airflow operator.
> >> > > > >> > > >
> >> > > > >> > > > Gerard Casas Saez
> >> > > > >> > > > Twitter | Cortex | @casassaez
> ><http://twitter.com/casassaez>
> >> > > > >> > > >
> >> > > > >> > > >
> >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> >> > > > >> > > > whatwouldaustindo@gmail.com>
> >> > > > >> > > > wrote:
> >> > > > >> > > >
> >> > > > >> > > > > Are you guys familiar with Beam
> ><https://beam.apache.org>?
> >> > Esp.
> >> > > > >> if
> >> > > > >> > not
> >> > > > >> > > > > doing transforms, it might rather straightforward to
> >rely
> >> > on the
> >> > > > >> > > > ecosystem
> >> > > > >> > > > > of connectors in that Apache Project to use as the
> >> > foundations
> >> > > > >> for a
> >> > > > >> > > > > generic transfer operator.
> >> > > > >> > > > >
> >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> >> > > > >> > Jarek.Potiuk@polidea.com>
> >> > > > >> > > > > wrote:
> >> > > > >> > > > >
> >> > > > >> > > > > > +1
> >> > > > >> > > > > >
> >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> >> > > > >> > > > > > kamil.olszewski@polidea.com>
> >> > > > >> > > > > > wrote:
> >> > > > >> > > > > >
> >> > > > >> > > > > > > Hello all,
> >> > > > >> > > > > > > since there have been no new comments shared in
> >the POC
> >> > doc
> >> > > > >> > > > > > > <
> >> > > > >> > > > > > >
> >> > > > >> > > > > >
> >> > > > >> > > > >
> >> > > > >> > > >
> >> > > > >> >
> >> > > > >>
> >> > > >
> >> >
> >https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > > for a couple of days, then I will proceed with
> >creating
> >> > an
> >> > > > AIP
> >> > > > >> > for
> >> > > > >> > > > this
> >> > > > >> > > > > > > feature, if that is ok with everybody.
> >> > > > >> > > > > > > Best regards,
> >> > > > >> > > > > > > Kamil
> >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek
> ><
> >> > > > >> > > > turbaszek@apache.org
> >> > > > >> > > > > >
> >> > > > >> > > > > > > wrote:
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > > I like the approach as it itnroduces another
> >> > interesting
> >> > > > >> > operators'
> >> > > > >> > > > > > > > interface standarization. It would be awesome
> >to here
> >> > more
> >> > > > >> > opinions
> >> > > > >> > > > > :)
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > > > Cheers,
> >> > > > >> > > > > > > > Tomek
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> >> > > > >> > > > > Jarek.Potiuk@polidea.com
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > > wrote:
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > > > > I like the idea a lot. Similar things have
> >been
> >> > > > discussed
> >> > > > >> > before
> >> > > > >> > > > > but
> >> > > > >> > > > > > > the
> >> > > > >> > > > > > > > > proposal is I think rather pragmatic and
> >solves a
> >> > real
> >> > > > >> > problem
> >> > > > >> > > > (and
> >> > > > >> > > > > > it
> >> > > > >> > > > > > > > does
> >> > > > >> > > > > > > > > not seem to be too complex to implement)
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > There is some discussion about it already in
> >the
> >> > > > document
> >> > > > >> > (please
> >> > > > >> > > > > > > > chime-in
> >> > > > >> > > > > > > > > for those interested) but here a few points
> >why I
> >> > like
> >> > > > it:
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > - performance and optimization is not a
> >focus for
> >> > that.
> >> > > > >> For
> >> > > > >> > > > generic
> >> > > > >> > > > > > > stuff
> >> > > > >> > > > > > > > > it is usually to write "optimal" solution
> >but once
> >> > you
> >> > > > >> admit
> >> > > > >> > you
> >> > > > >> > > > > are
> >> > > > >> > > > > > > not
> >> > > > >> > > > > > > > > going to focus for optimisation, you come
> >with
> >> > simpler
> >> > > > and
> >> > > > >> > easier
> >> > > > >> > > > > to
> >> > > > >> > > > > > > use
> >> > > > >> > > > > > > > > solutions
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > - on the other hand - it uses very
> >"Python'y"
> >> > approach
> >> > > > >> with
> >> > > > >> > using
> >> > > > >> > > > > > > > > Airflow's familiar concepts (connection,
> >transfer)
> >> > and
> >> > > > has
> >> > > > >> > the
> >> > > > >> > > > > > > potential
> >> > > > >> > > > > > > > of
> >> > > > >> > > > > > > > > plugging in into 100s of hooks we have
> >already
> >> > easily -
> >> > > > >> > > > leveraging
> >> > > > >> > > > > > all
> >> > > > >> > > > > > > > the
> >> > > > >> > > > > > > > > "providers" richness of Airflow.
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > - it aims to be easy to do "quick start" -
> >if you
> >> > have a
> >> > > > >> > number
> >> > > > >> > > > of
> >> > > > >> > > > > > > > > different sources/targets and as a data
> >scientist
> >> > you
> >> > > > >> would
> >> > > > >> > like
> >> > > > >> > > > to
> >> > > > >> > > > > > > > quickly
> >> > > > >> > > > > > > > > start transferring data between them - you
> >can do it
> >> > > > >> easily
> >> > > > >> > with
> >> > > > >> > > > > > only
> >> > > > >> > > > > > > > > basic python knowledge and simple DAG
> >structure.
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > - it should be possible to plug it in into
> >our new
> >> > > > >> functional
> >> > > > >> > > > > > approach
> >> > > > >> > > > > > > as
> >> > > > >> > > > > > > > > well as future lineage discussions as it
> >makes
> >> > > > connection
> >> > > > >> > between
> >> > > > >> > > > > > > sources
> >> > > > >> > > > > > > > > and targets
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > - it opens up possibilities of adding simple
> >and
> >> > > > flexible
> >> > > > >> > data
> >> > > > >> > > > > > > > > transformation on-transfer. Not a
> >replacement for
> >> > any of
> >> > > > >> the
> >> > > > >> > > > > external
> >> > > > >> > > > > > > > > services that Airflow should use (Airflow is
> >an
> >> > > > >> > orchestrator, not
> >> > > > >> > > > > > data
> >> > > > >> > > > > > > > > processing solution) but for the kind of
> >quick-start
> >> > > > >> > scenarios I
> >> > > > >> > > > > > > foresee
> >> > > > >> > > > > > > > it
> >> > > > >> > > > > > > > > might be most useful, being able to apply
> >simple
> >> > data
> >> > > > >> > > > > transformation
> >> > > > >> > > > > > on
> >> > > > >> > > > > > > > the
> >> > > > >> > > > > > > > > fly by data scientist might be a big plus.
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of
> >the
> >> > "data"
> >> > > > >> > component
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > Kamil - you should have access now.
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > J.
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
> >Olszewski <
> >> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
> >> > > > >> > > > > > > > > wrote:
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > > Hello all,
> >> > > > >> > > > > > > > > > in Polidea we have come up with an idea
> >for a
> >> > generic
> >> > > > >> > transfer
> >> > > > >> > > > > > > operator
> >> > > > >> > > > > > > > > > that would be able to transport data
> >between two
> >> > > > >> > destinations
> >> > > > >> > > > of
> >> > > > >> > > > > > > > various
> >> > > > >> > > > > > > > > > types (file, database, storage, etc.) -
> >please
> >> > find
> >> > > > the
> >> > > > >> > link
> >> > > > >> > > > > with a
> >> > > > >> > > > > > > > short
> >> > > > >> > > > > > > > > > doc with POC
> >> > > > >> > > > > > > > > > <
> >> > > > >> > > > > > > > > >
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > >
> >> > > > >> > > > > >
> >> > > > >> > > > >
> >> > > > >> > > >
> >> > > > >> >
> >> > > > >>
> >> > > >
> >> >
> >https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> >> > > > >> > > > > > > > > > >
> >> > > > >> > > > > > > > > > where we can discuss the design initially.
> >Once we
> >> > > > come
> >> > > > >> to
> >> > > > >> > the
> >> > > > >> > > > > > > initial
> >> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki -
> >can I
> >> > ask
> >> > > > for
> >> > > > >> > > > > permission
> >> > > > >> > > > > > to
> >> > > > >> > > > > > > > do
> >> > > > >> > > > > > > > > so
> >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe
> >that
> >> > during
> >> > > > the
> >> > > > >> > > > > discussion
> >> > > > >> > > > > > we
> >> > > > >> > > > > > > > > > should definitely aim for this feature to
> >be
> >> > released
> >> > > > >> only
> >> > > > >> > > > after
> >> > > > >> > > > > > > > Airflow
> >> > > > >> > > > > > > > > > 2.0 is out.
> >> > > > >> > > > > > > > > >
> >> > > > >> > > > > > > > > > What do you think about this idea? Would
> >you find
> >> > such
> >> > > > >> an
> >> > > > >> > > > > operator
> >> > > > >> > > > > > > > > helpful
> >> > > > >> > > > > > > > > > in your pipelines? Maybe you already use a
> >similar
> >> > > > >> > solution or
> >> > > > >> > > > > know
> >> > > > >> > > > > > > > > > packages that could be used to implement
> >it?
> >> > > > >> > > > > > > > > >
> >> > > > >> > > > > > > > > > Best regards,
> >> > > > >> > > > > > > > > > --
> >> > > > >> > > > > > > > > >
> >> > > > >> > > > > > > > > > Kamil Olszewski
> >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
> >Software
> >> > Engineer
> >> > > > >> > > > > > > > > >
> >> > > > >> > > > > > > > > > M: +48 503 361 783
> >> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> >> > > > >> > > > > > > > > >
> >> > > > >> > > > > > > > > > Unique Tech
> >> > > > >> > > > > > > > > > Check out our projects! <
> >> > > > >> https://www.polidea.com/our-work>
> >> > > > >> > > > > > > > > >
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > --
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > Jarek Potiuk
> >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
> >Principal
> >> > Software
> >> > > > >> > Engineer
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> >> > > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > >
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > --
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > Kamil Olszewski
> >> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software
> >Engineer
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > M: +48 503 361 783
> >> > > > >> > > > > > > E: kamil.olszewski@polidea.com
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > Unique Tech
> >> > > > >> > > > > > > Check out our projects! <
> >> > https://www.polidea.com/our-work>
> >> > > > >> > > > > > >
> >> > > > >> > > > > >
> >> > > > >> > > > > >
> >> > > > >> > > > > > --
> >> > > > >> > > > > >
> >> > > > >> > > > > > Jarek Potiuk
> >> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal
> >Software
> >> > > > >> Engineer
> >> > > > >> > > > > >
> >> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
> >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> >> > > > >> > > > > >
> >> > > > >> > > > >
> >> > > > >> >
> >> > > > >> >
> >> > > > >> >
> >> > > > >> > --
> >> > > > >> >
> >> > > > >> > Tomasz Urbaszek
> >> > > > >> > Polidea | Software Engineer
> >> > > > >> >
> >> > > > >> > M: +48 505 628 493
> >> > > > >> > E: tomasz.urbaszek@polidea.com
> >> > > > >> >
> >> > > > >> > Unique Tech
> >> > > > >> > Check out our projects!
> >> > > > >> >
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > >
> >> > > > > Jarek Potiuk
> >> > > > > Polidea <https://www.polidea.com/> | Principal Software
> >Engineer
> >> > > > >
> >> > > > > M: +48 660 796 129 <+48660796129>
> >> > > > > [image: Polidea] <https://www.polidea.com/>
> >> > > > >
> >> > > > >
> >> > > >
> >> > > > --
> >> > > >
> >> > > > Jarek Potiuk
> >> > > > Polidea <https://www.polidea.com/> | Principal Software
> >Engineer
> >> > > >
> >> > > > M: +48 660 796 129 <+48660796129>
> >> > > > [image: Polidea] <https://www.polidea.com/>
> >> > > >
> >> >

Re: Generic Transfer Operator

Posted by Ash Berlin-Taylor <as...@apache.org>.
For background: in the past I had an S3 to S3 transfer using smartopen (since we wanted to split one giant ~300GB file onto smaller parts) and it took about 10mins, so even "large" uses can work fine in Airflow - no JVM required.

-ash

On 6 September 2020 12:01:24 BST, Tomasz Urbaszek <tu...@apache.org> wrote:
>I think using direct runner as default with the option to specify
>other setup is a win-win. However, there are few doubts I have about
>Beam based approach:
>
>1. Dependency management. If I do `pip install apache-airflow[gcp]`
>will it install `apache-beam[gcp]`? What if there's a version clash
>between dependencies?
>
>2. The initial approach using `DataSource` concept allowed users to
>use it in any operator (not only transfer ones). In case of relying on
>Beam we are losing this.
>
>3. I'm not a Beam expert but it seems to not support any data lineage
>solution?
>
>
>On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
><da...@gmail.com> wrote:
>>
>> I think there are absolutely use-cases for both. I’m totally fine
>with saying “for small/medium use-cases, we come with an in-house
>system. However for larger cases, you’ll require spark/Flink/S3. That’s
>totally in line with PLENTY of use-cases. This would be especially cool
>when matched with fast-follow as we could EVEN potentially tie in data
>locality.
>>
>> via Newton Mail
>[https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
>> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
><wh...@gmail.com> wrote:
>> I believe - for not large data - the direct runner is wholly doable,
>which
>> seems in line with airflow patterns. I have, and have spoken with
>several
>> others that have, been productive with that runner.
>>
>> For much larger transfers, the generic operator could accept
>parameters for
>> submitting the compute to an actual runner. Though, imagining that
>> (needing a runner) would not be the primary use case for such an
>operator.
>>
>>
>> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <tu...@apache.org>
>wrote:
>>
>> > Austin, you are right, Beam covers all (and more) important IOs.
>> > However, using Apache Beam to design a generic transfer operator
>> > requires Airflow users to have additional resources that will be
>used
>> > as a runner (Spark, Flink, etc.). Unless you suggest using
>> > DirectRunner?
>> >
>> > Can you please tell us more how exactly you think we can use Beam
>for
>> > those Airflow transfer operators?
>> >
>> > Best,
>> > Tomek
>> >
>> >
>> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
>> > <wh...@gmail.com> wrote:
>> > >
>> > > Are there IOs that would be desired for a generic transfer
>operator that
>> > > don't exist in:
>https://beam.apache.org/documentation/io/built-in/ <-
>> > > there is pretty solid coverage?
>> > >
>> > > Beam is getting to the point where even python beam can leverage
>the java
>> > > IOs, which increases the range of IOs (and performance).
>> > >
>> > >
>> > >
>> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
><Ja...@polidea.com>
>> > > wrote:
>> > >
>> > > > But I believe those two ideas are separate ones as Tomek
>explained :)
>> > > >
>> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
><Jarek.Potiuk@polidea.com
>> > >
>> > > > wrote:
>> > > >
>> > > > > I love the idea of connecting the projects more closely!
>> > > > >
>> > > > > I've been helping recently as a consultant in improving the
>Apache
>> > Beam
>> > > > > build infrastructure (in many parts based on my Airflow
>experience
>> > and
>> > > > > Github Actions - even recently they adopted the "cancel"
>action I
>> > > > developed
>> > > > > for Apache Airflow).
>https://github.com/apache/beam/pull/12729
>> > > > >
>> > > > > Synergies in Apache projects are cool.
>> > > > >
>> > > > > J.
>> > > > >
>> > > > >
>> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
>> > > > > <gc...@twitter.com.invalid> wrote:
>> > > > >
>> > > > >> Agree on keeping those separate, just intervened as I
>believe its a
>> > > > great
>> > > > >> idea. But lets keep @beam and @spark to a separate thread.
>> > > > >>
>> > > > >>
>> > > > >> Gerard Casas Saez
>> > > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>> > > > >>
>> > > > >>
>> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
>> > turbaszek@apache.org>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Daniel is right we have few Apache Beam committers in
>Polidea so
>> > we
>> > > > >> > will ask for advice. However, I would be highly in favor
>of
>> > having it
>> > > > >> > as Gerard suggested as @beam decorator. This is something
>we
>> > should
>> > > > >> > put into another AIP together with the mentioned @spark
>decorator.
>> > > > >> >
>> > > > >> > Our proposition of transfer operators was mainly to create
>> > something
>> > > > >> > Airflow-native that works out of the box and allows us to
>simplify
>> > > > >> > read/write from external sources. Thus, it requires no
>external
>> > > > >> > dependency other than the library to communicate with the
>API. In
>> > the
>> > > > >> > case of Beam we need more than that I think.
>> > > > >> >
>> > > > >> > Additionally, the ideas of Source and Destination play
>nicely with
>> > > > >> > data lineage and may bring more interest to this feature
>of
>> > Airflow.
>> > > > >> >
>> > > > >> > Cheers,
>> > > > >> > Tomek
>> > > > >> >
>> > > > >> >
>> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
><ka...@gmail.com>
>> > > > wrote:
>> > > > >> > >
>> > > > >> > > Nice. Just a note here, we will need to make sure that
>those
>> > > > "Source"
>> > > > >> and
>> > > > >> > > "Destination" needs to be serializable.
>> > > > >> > >
>> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
>> > > > daniel.imberman@gmail.com
>> > > > >> >
>> > > > >> > > wrote:
>> > > > >> > >
>> > > > >> > > > Interesting! Beam also could potentially allow
>transfers
>> > within
>> > > > >> > Dask/any
>> > > > >> > > > other system with a java/python SDK? I think @jarek
>and
>> > Polidea
>> > > > do a
>> > > > >> > lot of
>> > > > >> > > > work with Beam as well so I’d love their thoughts if
>this a
>> > good
>> > > > >> > use-case.
>> > > > >> > > >
>> > > > >> > > > via Newton Mail [
>> > > > >> > > >
>> > > > >> >
>> > > > >>
>> > > >
>> >
>https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
>> > > > >> > > > ]
>> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
>> > > > >> > gcasassaez@twitter.com.invalid>
>> > > > >> > > > wrote:
>> > > > >> > > > I would be highly in favour of having a generic Beam
>operator.
>> > > > >> Similar
>> > > > >> > > > to @spark_task decorator. Something where you can
>easily
>> > define
>> > > > and
>> > > > >> > wrap a
>> > > > >> > > > beam pipeline and convert it to an Airflow operator.
>> > > > >> > > >
>> > > > >> > > > Gerard Casas Saez
>> > > > >> > > > Twitter | Cortex | @casassaez
><http://twitter.com/casassaez>
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
>> > > > >> > > > whatwouldaustindo@gmail.com>
>> > > > >> > > > wrote:
>> > > > >> > > >
>> > > > >> > > > > Are you guys familiar with Beam
><https://beam.apache.org>?
>> > Esp.
>> > > > >> if
>> > > > >> > not
>> > > > >> > > > > doing transforms, it might rather straightforward to
>rely
>> > on the
>> > > > >> > > > ecosystem
>> > > > >> > > > > of connectors in that Apache Project to use as the
>> > foundations
>> > > > >> for a
>> > > > >> > > > > generic transfer operator.
>> > > > >> > > > >
>> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
>> > > > >> > Jarek.Potiuk@polidea.com>
>> > > > >> > > > > wrote:
>> > > > >> > > > >
>> > > > >> > > > > > +1
>> > > > >> > > > > >
>> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
>> > > > >> > > > > > kamil.olszewski@polidea.com>
>> > > > >> > > > > > wrote:
>> > > > >> > > > > >
>> > > > >> > > > > > > Hello all,
>> > > > >> > > > > > > since there have been no new comments shared in
>the POC
>> > doc
>> > > > >> > > > > > > <
>> > > > >> > > > > > >
>> > > > >> > > > > >
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> >
>> > > > >>
>> > > >
>> >
>https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
>> > > > >> > > > > > > >
>> > > > >> > > > > > > for a couple of days, then I will proceed with
>creating
>> > an
>> > > > AIP
>> > > > >> > for
>> > > > >> > > > this
>> > > > >> > > > > > > feature, if that is ok with everybody.
>> > > > >> > > > > > > Best regards,
>> > > > >> > > > > > > Kamil
>> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek
><
>> > > > >> > > > turbaszek@apache.org
>> > > > >> > > > > >
>> > > > >> > > > > > > wrote:
>> > > > >> > > > > > >
>> > > > >> > > > > > > > I like the approach as it itnroduces another
>> > interesting
>> > > > >> > operators'
>> > > > >> > > > > > > > interface standarization. It would be awesome
>to here
>> > more
>> > > > >> > opinions
>> > > > >> > > > > :)
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > Cheers,
>> > > > >> > > > > > > > Tomek
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
>> > > > >> > > > > Jarek.Potiuk@polidea.com
>> > > > >> > > > > > >
>> > > > >> > > > > > > > wrote:
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > > I like the idea a lot. Similar things have
>been
>> > > > discussed
>> > > > >> > before
>> > > > >> > > > > but
>> > > > >> > > > > > > the
>> > > > >> > > > > > > > > proposal is I think rather pragmatic and
>solves a
>> > real
>> > > > >> > problem
>> > > > >> > > > (and
>> > > > >> > > > > > it
>> > > > >> > > > > > > > does
>> > > > >> > > > > > > > > not seem to be too complex to implement)
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > There is some discussion about it already in
>the
>> > > > document
>> > > > >> > (please
>> > > > >> > > > > > > > chime-in
>> > > > >> > > > > > > > > for those interested) but here a few points
>why I
>> > like
>> > > > it:
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - performance and optimization is not a
>focus for
>> > that.
>> > > > >> For
>> > > > >> > > > generic
>> > > > >> > > > > > > stuff
>> > > > >> > > > > > > > > it is usually to write "optimal" solution
>but once
>> > you
>> > > > >> admit
>> > > > >> > you
>> > > > >> > > > > are
>> > > > >> > > > > > > not
>> > > > >> > > > > > > > > going to focus for optimisation, you come
>with
>> > simpler
>> > > > and
>> > > > >> > easier
>> > > > >> > > > > to
>> > > > >> > > > > > > use
>> > > > >> > > > > > > > > solutions
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - on the other hand - it uses very
>"Python'y"
>> > approach
>> > > > >> with
>> > > > >> > using
>> > > > >> > > > > > > > > Airflow's familiar concepts (connection,
>transfer)
>> > and
>> > > > has
>> > > > >> > the
>> > > > >> > > > > > > potential
>> > > > >> > > > > > > > of
>> > > > >> > > > > > > > > plugging in into 100s of hooks we have
>already
>> > easily -
>> > > > >> > > > leveraging
>> > > > >> > > > > > all
>> > > > >> > > > > > > > the
>> > > > >> > > > > > > > > "providers" richness of Airflow.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - it aims to be easy to do "quick start" -
>if you
>> > have a
>> > > > >> > number
>> > > > >> > > > of
>> > > > >> > > > > > > > > different sources/targets and as a data
>scientist
>> > you
>> > > > >> would
>> > > > >> > like
>> > > > >> > > > to
>> > > > >> > > > > > > > quickly
>> > > > >> > > > > > > > > start transferring data between them - you
>can do it
>> > > > >> easily
>> > > > >> > with
>> > > > >> > > > > > only
>> > > > >> > > > > > > > > basic python knowledge and simple DAG
>structure.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - it should be possible to plug it in into
>our new
>> > > > >> functional
>> > > > >> > > > > > approach
>> > > > >> > > > > > > as
>> > > > >> > > > > > > > > well as future lineage discussions as it
>makes
>> > > > connection
>> > > > >> > between
>> > > > >> > > > > > > sources
>> > > > >> > > > > > > > > and targets
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - it opens up possibilities of adding simple
>and
>> > > > flexible
>> > > > >> > data
>> > > > >> > > > > > > > > transformation on-transfer. Not a
>replacement for
>> > any of
>> > > > >> the
>> > > > >> > > > > external
>> > > > >> > > > > > > > > services that Airflow should use (Airflow is
>an
>> > > > >> > orchestrator, not
>> > > > >> > > > > > data
>> > > > >> > > > > > > > > processing solution) but for the kind of
>quick-start
>> > > > >> > scenarios I
>> > > > >> > > > > > > foresee
>> > > > >> > > > > > > > it
>> > > > >> > > > > > > > > might be most useful, being able to apply
>simple
>> > data
>> > > > >> > > > > transformation
>> > > > >> > > > > > on
>> > > > >> > > > > > > > the
>> > > > >> > > > > > > > > fly by data scientist might be a big plus.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of
>the
>> > "data"
>> > > > >> > component
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > Kamil - you should have access now.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > J.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
>Olszewski <
>> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
>> > > > >> > > > > > > > > wrote:
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > > Hello all,
>> > > > >> > > > > > > > > > in Polidea we have come up with an idea
>for a
>> > generic
>> > > > >> > transfer
>> > > > >> > > > > > > operator
>> > > > >> > > > > > > > > > that would be able to transport data
>between two
>> > > > >> > destinations
>> > > > >> > > > of
>> > > > >> > > > > > > > various
>> > > > >> > > > > > > > > > types (file, database, storage, etc.) -
>please
>> > find
>> > > > the
>> > > > >> > link
>> > > > >> > > > > with a
>> > > > >> > > > > > > > short
>> > > > >> > > > > > > > > > doc with POC
>> > > > >> > > > > > > > > > <
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > >
>> > > > >> > > > > > >
>> > > > >> > > > > >
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> >
>> > > > >>
>> > > >
>> >
>https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
>> > > > >> > > > > > > > > > >
>> > > > >> > > > > > > > > > where we can discuss the design initially.
>Once we
>> > > > come
>> > > > >> to
>> > > > >> > the
>> > > > >> > > > > > > initial
>> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki -
>can I
>> > ask
>> > > > for
>> > > > >> > > > > permission
>> > > > >> > > > > > to
>> > > > >> > > > > > > > do
>> > > > >> > > > > > > > > so
>> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe
>that
>> > during
>> > > > the
>> > > > >> > > > > discussion
>> > > > >> > > > > > we
>> > > > >> > > > > > > > > > should definitely aim for this feature to
>be
>> > released
>> > > > >> only
>> > > > >> > > > after
>> > > > >> > > > > > > > Airflow
>> > > > >> > > > > > > > > > 2.0 is out.
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > What do you think about this idea? Would
>you find
>> > such
>> > > > >> an
>> > > > >> > > > > operator
>> > > > >> > > > > > > > > helpful
>> > > > >> > > > > > > > > > in your pipelines? Maybe you already use a
>similar
>> > > > >> > solution or
>> > > > >> > > > > know
>> > > > >> > > > > > > > > > packages that could be used to implement
>it?
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > Best regards,
>> > > > >> > > > > > > > > > --
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > Kamil Olszewski
>> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
>Software
>> > Engineer
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > M: +48 503 361 783
>> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > Unique Tech
>> > > > >> > > > > > > > > > Check out our projects! <
>> > > > >> https://www.polidea.com/our-work>
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > --
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > Jarek Potiuk
>> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
>Principal
>> > Software
>> > > > >> > Engineer
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
>> > > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > >
>> > > > >> > > > > > >
>> > > > >> > > > > > >
>> > > > >> > > > > > > --
>> > > > >> > > > > > >
>> > > > >> > > > > > > Kamil Olszewski
>> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software
>Engineer
>> > > > >> > > > > > >
>> > > > >> > > > > > > M: +48 503 361 783
>> > > > >> > > > > > > E: kamil.olszewski@polidea.com
>> > > > >> > > > > > >
>> > > > >> > > > > > > Unique Tech
>> > > > >> > > > > > > Check out our projects! <
>> > https://www.polidea.com/our-work>
>> > > > >> > > > > > >
>> > > > >> > > > > >
>> > > > >> > > > > >
>> > > > >> > > > > > --
>> > > > >> > > > > >
>> > > > >> > > > > > Jarek Potiuk
>> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal
>Software
>> > > > >> Engineer
>> > > > >> > > > > >
>> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
>> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
>> > > > >> > > > > >
>> > > > >> > > > >
>> > > > >> >
>> > > > >> >
>> > > > >> >
>> > > > >> > --
>> > > > >> >
>> > > > >> > Tomasz Urbaszek
>> > > > >> > Polidea | Software Engineer
>> > > > >> >
>> > > > >> > M: +48 505 628 493
>> > > > >> > E: tomasz.urbaszek@polidea.com
>> > > > >> >
>> > > > >> > Unique Tech
>> > > > >> > Check out our projects!
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > > > --
>> > > > >
>> > > > > Jarek Potiuk
>> > > > > Polidea <https://www.polidea.com/> | Principal Software
>Engineer
>> > > > >
>> > > > > M: +48 660 796 129 <+48660796129>
>> > > > > [image: Polidea] <https://www.polidea.com/>
>> > > > >
>> > > > >
>> > > >
>> > > > --
>> > > >
>> > > > Jarek Potiuk
>> > > > Polidea <https://www.polidea.com/> | Principal Software
>Engineer
>> > > >
>> > > > M: +48 660 796 129 <+48660796129>
>> > > > [image: Polidea] <https://www.polidea.com/>
>> > > >
>> >

Re: Generic Transfer Operator

Posted by Tomasz Urbaszek <tu...@apache.org>.
I think using direct runner as default with the option to specify
other setup is a win-win. However, there are few doubts I have about
Beam based approach:

1. Dependency management. If I do `pip install apache-airflow[gcp]`
will it install `apache-beam[gcp]`? What if there's a version clash
between dependencies?

2. The initial approach using `DataSource` concept allowed users to
use it in any operator (not only transfer ones). In case of relying on
Beam we are losing this.

3. I'm not a Beam expert but it seems to not support any data lineage solution?


On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
<da...@gmail.com> wrote:
>
> I think there are absolutely use-cases for both. I’m totally fine with saying “for small/medium use-cases, we come with an in-house system. However for larger cases, you’ll require spark/Flink/S3. That’s totally in line with PLENTY of use-cases. This would be especially cool when matched with fast-follow as we could EVEN potentially tie in data locality.
>
> via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett <wh...@gmail.com> wrote:
> I believe - for not large data - the direct runner is wholly doable, which
> seems in line with airflow patterns. I have, and have spoken with several
> others that have, been productive with that runner.
>
> For much larger transfers, the generic operator could accept parameters for
> submitting the compute to an actual runner. Though, imagining that
> (needing a runner) would not be the primary use case for such an operator.
>
>
> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <tu...@apache.org> wrote:
>
> > Austin, you are right, Beam covers all (and more) important IOs.
> > However, using Apache Beam to design a generic transfer operator
> > requires Airflow users to have additional resources that will be used
> > as a runner (Spark, Flink, etc.). Unless you suggest using
> > DirectRunner?
> >
> > Can you please tell us more how exactly you think we can use Beam for
> > those Airflow transfer operators?
> >
> > Best,
> > Tomek
> >
> >
> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> > <wh...@gmail.com> wrote:
> > >
> > > Are there IOs that would be desired for a generic transfer operator that
> > > don't exist in: https://beam.apache.org/documentation/io/built-in/ <-
> > > there is pretty solid coverage?
> > >
> > > Beam is getting to the point where even python beam can leverage the java
> > > IOs, which increases the range of IOs (and performance).
> > >
> > >
> > >
> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk <Ja...@polidea.com>
> > > wrote:
> > >
> > > > But I believe those two ideas are separate ones as Tomek explained :)
> > > >
> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk <Jarek.Potiuk@polidea.com
> > >
> > > > wrote:
> > > >
> > > > > I love the idea of connecting the projects more closely!
> > > > >
> > > > > I've been helping recently as a consultant in improving the Apache
> > Beam
> > > > > build infrastructure (in many parts based on my Airflow experience
> > and
> > > > > Github Actions - even recently they adopted the "cancel" action I
> > > > developed
> > > > > for Apache Airflow). https://github.com/apache/beam/pull/12729
> > > > >
> > > > > Synergies in Apache projects are cool.
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > > > > <gc...@twitter.com.invalid> wrote:
> > > > >
> > > > >> Agree on keeping those separate, just intervened as I believe its a
> > > > great
> > > > >> idea. But lets keep @beam and @spark to a separate thread.
> > > > >>
> > > > >>
> > > > >> Gerard Casas Saez
> > > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > > > >>
> > > > >>
> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> > turbaszek@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >> > Daniel is right we have few Apache Beam committers in Polidea so
> > we
> > > > >> > will ask for advice. However, I would be highly in favor of
> > having it
> > > > >> > as Gerard suggested as @beam decorator. This is something we
> > should
> > > > >> > put into another AIP together with the mentioned @spark decorator.
> > > > >> >
> > > > >> > Our proposition of transfer operators was mainly to create
> > something
> > > > >> > Airflow-native that works out of the box and allows us to simplify
> > > > >> > read/write from external sources. Thus, it requires no external
> > > > >> > dependency other than the library to communicate with the API. In
> > the
> > > > >> > case of Beam we need more than that I think.
> > > > >> >
> > > > >> > Additionally, the ideas of Source and Destination play nicely with
> > > > >> > data lineage and may bring more interest to this feature of
> > Airflow.
> > > > >> >
> > > > >> > Cheers,
> > > > >> > Tomek
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <ka...@gmail.com>
> > > > wrote:
> > > > >> > >
> > > > >> > > Nice. Just a note here, we will need to make sure that those
> > > > "Source"
> > > > >> and
> > > > >> > > "Destination" needs to be serializable.
> > > > >> > >
> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > > > daniel.imberman@gmail.com
> > > > >> >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Interesting! Beam also could potentially allow transfers
> > within
> > > > >> > Dask/any
> > > > >> > > > other system with a java/python SDK? I think @jarek and
> > Polidea
> > > > do a
> > > > >> > lot of
> > > > >> > > > work with Beam as well so I’d love their thoughts if this a
> > good
> > > > >> > use-case.
> > > > >> > > >
> > > > >> > > > via Newton Mail [
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > > >> > > > ]
> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> > > > >> > gcasassaez@twitter.com.invalid>
> > > > >> > > > wrote:
> > > > >> > > > I would be highly in favour of having a generic Beam operator.
> > > > >> Similar
> > > > >> > > > to @spark_task decorator. Something where you can easily
> > define
> > > > and
> > > > >> > wrap a
> > > > >> > > > beam pipeline and convert it to an Airflow operator.
> > > > >> > > >
> > > > >> > > > Gerard Casas Saez
> > > > >> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > > > >> > > > whatwouldaustindo@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Are you guys familiar with Beam <https://beam.apache.org>?
> > Esp.
> > > > >> if
> > > > >> > not
> > > > >> > > > > doing transforms, it might rather straightforward to rely
> > on the
> > > > >> > > > ecosystem
> > > > >> > > > > of connectors in that Apache Project to use as the
> > foundations
> > > > >> for a
> > > > >> > > > > generic transfer operator.
> > > > >> > > > >
> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > > > >> > Jarek.Potiuk@polidea.com>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > +1
> > > > >> > > > > >
> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > > > >> > > > > > kamil.olszewski@polidea.com>
> > > > >> > > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Hello all,
> > > > >> > > > > > > since there have been no new comments shared in the POC
> > doc
> > > > >> > > > > > > <
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > > >> > > > > > > >
> > > > >> > > > > > > for a couple of days, then I will proceed with creating
> > an
> > > > AIP
> > > > >> > for
> > > > >> > > > this
> > > > >> > > > > > > feature, if that is ok with everybody.
> > > > >> > > > > > > Best regards,
> > > > >> > > > > > > Kamil
> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> > > > >> > > > turbaszek@apache.org
> > > > >> > > > > >
> > > > >> > > > > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > I like the approach as it itnroduces another
> > interesting
> > > > >> > operators'
> > > > >> > > > > > > > interface standarization. It would be awesome to here
> > more
> > > > >> > opinions
> > > > >> > > > > :)
> > > > >> > > > > > > >
> > > > >> > > > > > > > Cheers,
> > > > >> > > > > > > > Tomek
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > > > >> > > > > Jarek.Potiuk@polidea.com
> > > > >> > > > > > >
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > I like the idea a lot. Similar things have been
> > > > discussed
> > > > >> > before
> > > > >> > > > > but
> > > > >> > > > > > > the
> > > > >> > > > > > > > > proposal is I think rather pragmatic and solves a
> > real
> > > > >> > problem
> > > > >> > > > (and
> > > > >> > > > > > it
> > > > >> > > > > > > > does
> > > > >> > > > > > > > > not seem to be too complex to implement)
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > There is some discussion about it already in the
> > > > document
> > > > >> > (please
> > > > >> > > > > > > > chime-in
> > > > >> > > > > > > > > for those interested) but here a few points why I
> > like
> > > > it:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - performance and optimization is not a focus for
> > that.
> > > > >> For
> > > > >> > > > generic
> > > > >> > > > > > > stuff
> > > > >> > > > > > > > > it is usually to write "optimal" solution but once
> > you
> > > > >> admit
> > > > >> > you
> > > > >> > > > > are
> > > > >> > > > > > > not
> > > > >> > > > > > > > > going to focus for optimisation, you come with
> > simpler
> > > > and
> > > > >> > easier
> > > > >> > > > > to
> > > > >> > > > > > > use
> > > > >> > > > > > > > > solutions
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - on the other hand - it uses very "Python'y"
> > approach
> > > > >> with
> > > > >> > using
> > > > >> > > > > > > > > Airflow's familiar concepts (connection, transfer)
> > and
> > > > has
> > > > >> > the
> > > > >> > > > > > > potential
> > > > >> > > > > > > > of
> > > > >> > > > > > > > > plugging in into 100s of hooks we have already
> > easily -
> > > > >> > > > leveraging
> > > > >> > > > > > all
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > "providers" richness of Airflow.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - it aims to be easy to do "quick start" - if you
> > have a
> > > > >> > number
> > > > >> > > > of
> > > > >> > > > > > > > > different sources/targets and as a data scientist
> > you
> > > > >> would
> > > > >> > like
> > > > >> > > > to
> > > > >> > > > > > > > quickly
> > > > >> > > > > > > > > start transferring data between them - you can do it
> > > > >> easily
> > > > >> > with
> > > > >> > > > > > only
> > > > >> > > > > > > > > basic python knowledge and simple DAG structure.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - it should be possible to plug it in into our new
> > > > >> functional
> > > > >> > > > > > approach
> > > > >> > > > > > > as
> > > > >> > > > > > > > > well as future lineage discussions as it makes
> > > > connection
> > > > >> > between
> > > > >> > > > > > > sources
> > > > >> > > > > > > > > and targets
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - it opens up possibilities of adding simple and
> > > > flexible
> > > > >> > data
> > > > >> > > > > > > > > transformation on-transfer. Not a replacement for
> > any of
> > > > >> the
> > > > >> > > > > external
> > > > >> > > > > > > > > services that Airflow should use (Airflow is an
> > > > >> > orchestrator, not
> > > > >> > > > > > data
> > > > >> > > > > > > > > processing solution) but for the kind of quick-start
> > > > >> > scenarios I
> > > > >> > > > > > > foresee
> > > > >> > > > > > > > it
> > > > >> > > > > > > > > might be most useful, being able to apply simple
> > data
> > > > >> > > > > transformation
> > > > >> > > > > > on
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > fly by data scientist might be a big plus.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of the
> > "data"
> > > > >> > component
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Kamil - you should have access now.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > J.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > > >> > > > > > > > > kamil.olszewski@polidea.com>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Hello all,
> > > > >> > > > > > > > > > in Polidea we have come up with an idea for a
> > generic
> > > > >> > transfer
> > > > >> > > > > > > operator
> > > > >> > > > > > > > > > that would be able to transport data between two
> > > > >> > destinations
> > > > >> > > > of
> > > > >> > > > > > > > various
> > > > >> > > > > > > > > > types (file, database, storage, etc.) - please
> > find
> > > > the
> > > > >> > link
> > > > >> > > > > with a
> > > > >> > > > > > > > short
> > > > >> > > > > > > > > > doc with POC
> > > > >> > > > > > > > > > <
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > where we can discuss the design initially. Once we
> > > > come
> > > > >> to
> > > > >> > the
> > > > >> > > > > > > initial
> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki - can I
> > ask
> > > > for
> > > > >> > > > > permission
> > > > >> > > > > > to
> > > > >> > > > > > > > do
> > > > >> > > > > > > > > so
> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that
> > during
> > > > the
> > > > >> > > > > discussion
> > > > >> > > > > > we
> > > > >> > > > > > > > > > should definitely aim for this feature to be
> > released
> > > > >> only
> > > > >> > > > after
> > > > >> > > > > > > > Airflow
> > > > >> > > > > > > > > > 2.0 is out.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > What do you think about this idea? Would you find
> > such
> > > > >> an
> > > > >> > > > > operator
> > > > >> > > > > > > > > helpful
> > > > >> > > > > > > > > > in your pipelines? Maybe you already use a similar
> > > > >> > solution or
> > > > >> > > > > know
> > > > >> > > > > > > > > > packages that could be used to implement it?
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Best regards,
> > > > >> > > > > > > > > > --
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Kamil Olszewski
> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> | Software
> > Engineer
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > M: +48 503 361 783
> > > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Unique Tech
> > > > >> > > > > > > > > > Check out our projects! <
> > > > >> https://www.polidea.com/our-work>
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > --
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Jarek Potiuk
> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> | Principal
> > Software
> > > > >> > Engineer
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > --
> > > > >> > > > > > >
> > > > >> > > > > > > Kamil Olszewski
> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > >> > > > > > >
> > > > >> > > > > > > M: +48 503 361 783
> > > > >> > > > > > > E: kamil.olszewski@polidea.com
> > > > >> > > > > > >
> > > > >> > > > > > > Unique Tech
> > > > >> > > > > > > Check out our projects! <
> > https://www.polidea.com/our-work>
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > --
> > > > >> > > > > >
> > > > >> > > > > > Jarek Potiuk
> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal Software
> > > > >> Engineer
> > > > >> > > > > >
> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> >
> > > > >> > Tomasz Urbaszek
> > > > >> > Polidea | Software Engineer
> > > > >> >
> > > > >> > M: +48 505 628 493
> > > > >> > E: tomasz.urbaszek@polidea.com
> > > > >> >
> > > > >> > Unique Tech
> > > > >> > Check out our projects!
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >
> > > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> >

Re: Generic Transfer Operator

Posted by Daniel Imberman <da...@gmail.com>.
I think there are absolutely use-cases for both. I’m totally fine with saying “for small/medium use-cases, we come with an in-house system. However for larger cases, you’ll require spark/Flink/S3. That’s totally in line with PLENTY of use-cases. This would be especially cool when matched with fast-follow as we could EVEN potentially tie in data locality.

via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett <wh...@gmail.com> wrote:
I believe - for not large data - the direct runner is wholly doable, which
seems in line with airflow patterns. I have, and have spoken with several
others that have, been productive with that runner.

For much larger transfers, the generic operator could accept parameters for
submitting the compute to an actual runner. Though, imagining that
(needing a runner) would not be the primary use case for such an operator.


On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <tu...@apache.org> wrote:

> Austin, you are right, Beam covers all (and more) important IOs.
> However, using Apache Beam to design a generic transfer operator
> requires Airflow users to have additional resources that will be used
> as a runner (Spark, Flink, etc.). Unless you suggest using
> DirectRunner?
>
> Can you please tell us more how exactly you think we can use Beam for
> those Airflow transfer operators?
>
> Best,
> Tomek
>
>
> On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> <wh...@gmail.com> wrote:
> >
> > Are there IOs that would be desired for a generic transfer operator that
> > don't exist in: https://beam.apache.org/documentation/io/built-in/ <-
> > there is pretty solid coverage?
> >
> > Beam is getting to the point where even python beam can leverage the java
> > IOs, which increases the range of IOs (and performance).
> >
> >
> >
> > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk <Ja...@polidea.com>
> > wrote:
> >
> > > But I believe those two ideas are separate ones as Tomek explained :)
> > >
> > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk <Jarek.Potiuk@polidea.com
> >
> > > wrote:
> > >
> > > > I love the idea of connecting the projects more closely!
> > > >
> > > > I've been helping recently as a consultant in improving the Apache
> Beam
> > > > build infrastructure (in many parts based on my Airflow experience
> and
> > > > Github Actions - even recently they adopted the "cancel" action I
> > > developed
> > > > for Apache Airflow). https://github.com/apache/beam/pull/12729
> > > >
> > > > Synergies in Apache projects are cool.
> > > >
> > > > J.
> > > >
> > > >
> > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > > > <gc...@twitter.com.invalid> wrote:
> > > >
> > > >> Agree on keeping those separate, just intervened as I believe its a
> > > great
> > > >> idea. But lets keep @beam and @spark to a separate thread.
> > > >>
> > > >>
> > > >> Gerard Casas Saez
> > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > > >>
> > > >>
> > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> turbaszek@apache.org>
> > > >> wrote:
> > > >>
> > > >> > Daniel is right we have few Apache Beam committers in Polidea so
> we
> > > >> > will ask for advice. However, I would be highly in favor of
> having it
> > > >> > as Gerard suggested as @beam decorator. This is something we
> should
> > > >> > put into another AIP together with the mentioned @spark decorator.
> > > >> >
> > > >> > Our proposition of transfer operators was mainly to create
> something
> > > >> > Airflow-native that works out of the box and allows us to simplify
> > > >> > read/write from external sources. Thus, it requires no external
> > > >> > dependency other than the library to communicate with the API. In
> the
> > > >> > case of Beam we need more than that I think.
> > > >> >
> > > >> > Additionally, the ideas of Source and Destination play nicely with
> > > >> > data lineage and may bring more interest to this feature of
> Airflow.
> > > >> >
> > > >> > Cheers,
> > > >> > Tomek
> > > >> >
> > > >> >
> > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <ka...@gmail.com>
> > > wrote:
> > > >> > >
> > > >> > > Nice. Just a note here, we will need to make sure that those
> > > "Source"
> > > >> and
> > > >> > > "Destination" needs to be serializable.
> > > >> > >
> > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > > daniel.imberman@gmail.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Interesting! Beam also could potentially allow transfers
> within
> > > >> > Dask/any
> > > >> > > > other system with a java/python SDK? I think @jarek and
> Polidea
> > > do a
> > > >> > lot of
> > > >> > > > work with Beam as well so I’d love their thoughts if this a
> good
> > > >> > use-case.
> > > >> > > >
> > > >> > > > via Newton Mail [
> > > >> > > >
> > > >> >
> > > >>
> > >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > >> > > > ]
> > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> > > >> > gcasassaez@twitter.com.invalid>
> > > >> > > > wrote:
> > > >> > > > I would be highly in favour of having a generic Beam operator.
> > > >> Similar
> > > >> > > > to @spark_task decorator. Something where you can easily
> define
> > > and
> > > >> > wrap a
> > > >> > > > beam pipeline and convert it to an Airflow operator.
> > > >> > > >
> > > >> > > > Gerard Casas Saez
> > > >> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > > >> > > >
> > > >> > > >
> > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > > >> > > > whatwouldaustindo@gmail.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Are you guys familiar with Beam <https://beam.apache.org>?
> Esp.
> > > >> if
> > > >> > not
> > > >> > > > > doing transforms, it might rather straightforward to rely
> on the
> > > >> > > > ecosystem
> > > >> > > > > of connectors in that Apache Project to use as the
> foundations
> > > >> for a
> > > >> > > > > generic transfer operator.
> > > >> > > > >
> > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > > >> > Jarek.Potiuk@polidea.com>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > +1
> > > >> > > > > >
> > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > > >> > > > > > kamil.olszewski@polidea.com>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hello all,
> > > >> > > > > > > since there have been no new comments shared in the POC
> doc
> > > >> > > > > > > <
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> >
> > > >>
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > >> > > > > > > >
> > > >> > > > > > > for a couple of days, then I will proceed with creating
> an
> > > AIP
> > > >> > for
> > > >> > > > this
> > > >> > > > > > > feature, if that is ok with everybody.
> > > >> > > > > > > Best regards,
> > > >> > > > > > > Kamil
> > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> > > >> > > > turbaszek@apache.org
> > > >> > > > > >
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > I like the approach as it itnroduces another
> interesting
> > > >> > operators'
> > > >> > > > > > > > interface standarization. It would be awesome to here
> more
> > > >> > opinions
> > > >> > > > > :)
> > > >> > > > > > > >
> > > >> > > > > > > > Cheers,
> > > >> > > > > > > > Tomek
> > > >> > > > > > > >
> > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > > >> > > > > Jarek.Potiuk@polidea.com
> > > >> > > > > > >
> > > >> > > > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > I like the idea a lot. Similar things have been
> > > discussed
> > > >> > before
> > > >> > > > > but
> > > >> > > > > > > the
> > > >> > > > > > > > > proposal is I think rather pragmatic and solves a
> real
> > > >> > problem
> > > >> > > > (and
> > > >> > > > > > it
> > > >> > > > > > > > does
> > > >> > > > > > > > > not seem to be too complex to implement)
> > > >> > > > > > > > >
> > > >> > > > > > > > > There is some discussion about it already in the
> > > document
> > > >> > (please
> > > >> > > > > > > > chime-in
> > > >> > > > > > > > > for those interested) but here a few points why I
> like
> > > it:
> > > >> > > > > > > > >
> > > >> > > > > > > > > - performance and optimization is not a focus for
> that.
> > > >> For
> > > >> > > > generic
> > > >> > > > > > > stuff
> > > >> > > > > > > > > it is usually to write "optimal" solution but once
> you
> > > >> admit
> > > >> > you
> > > >> > > > > are
> > > >> > > > > > > not
> > > >> > > > > > > > > going to focus for optimisation, you come with
> simpler
> > > and
> > > >> > easier
> > > >> > > > > to
> > > >> > > > > > > use
> > > >> > > > > > > > > solutions
> > > >> > > > > > > > >
> > > >> > > > > > > > > - on the other hand - it uses very "Python'y"
> approach
> > > >> with
> > > >> > using
> > > >> > > > > > > > > Airflow's familiar concepts (connection, transfer)
> and
> > > has
> > > >> > the
> > > >> > > > > > > potential
> > > >> > > > > > > > of
> > > >> > > > > > > > > plugging in into 100s of hooks we have already
> easily -
> > > >> > > > leveraging
> > > >> > > > > > all
> > > >> > > > > > > > the
> > > >> > > > > > > > > "providers" richness of Airflow.
> > > >> > > > > > > > >
> > > >> > > > > > > > > - it aims to be easy to do "quick start" - if you
> have a
> > > >> > number
> > > >> > > > of
> > > >> > > > > > > > > different sources/targets and as a data scientist
> you
> > > >> would
> > > >> > like
> > > >> > > > to
> > > >> > > > > > > > quickly
> > > >> > > > > > > > > start transferring data between them - you can do it
> > > >> easily
> > > >> > with
> > > >> > > > > > only
> > > >> > > > > > > > > basic python knowledge and simple DAG structure.
> > > >> > > > > > > > >
> > > >> > > > > > > > > - it should be possible to plug it in into our new
> > > >> functional
> > > >> > > > > > approach
> > > >> > > > > > > as
> > > >> > > > > > > > > well as future lineage discussions as it makes
> > > connection
> > > >> > between
> > > >> > > > > > > sources
> > > >> > > > > > > > > and targets
> > > >> > > > > > > > >
> > > >> > > > > > > > > - it opens up possibilities of adding simple and
> > > flexible
> > > >> > data
> > > >> > > > > > > > > transformation on-transfer. Not a replacement for
> any of
> > > >> the
> > > >> > > > > external
> > > >> > > > > > > > > services that Airflow should use (Airflow is an
> > > >> > orchestrator, not
> > > >> > > > > > data
> > > >> > > > > > > > > processing solution) but for the kind of quick-start
> > > >> > scenarios I
> > > >> > > > > > > foresee
> > > >> > > > > > > > it
> > > >> > > > > > > > > might be most useful, being able to apply simple
> data
> > > >> > > > > transformation
> > > >> > > > > > on
> > > >> > > > > > > > the
> > > >> > > > > > > > > fly by data scientist might be a big plus.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of the
> "data"
> > > >> > component
> > > >> > > > > > > > >
> > > >> > > > > > > > > Kamil - you should have access now.
> > > >> > > > > > > > >
> > > >> > > > > > > > > J.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > >> > > > > > > > > kamil.olszewski@polidea.com>
> > > >> > > > > > > > > wrote:
> > > >> > > > > > > > >
> > > >> > > > > > > > > > Hello all,
> > > >> > > > > > > > > > in Polidea we have come up with an idea for a
> generic
> > > >> > transfer
> > > >> > > > > > > operator
> > > >> > > > > > > > > > that would be able to transport data between two
> > > >> > destinations
> > > >> > > > of
> > > >> > > > > > > > various
> > > >> > > > > > > > > > types (file, database, storage, etc.) - please
> find
> > > the
> > > >> > link
> > > >> > > > > with a
> > > >> > > > > > > > short
> > > >> > > > > > > > > > doc with POC
> > > >> > > > > > > > > > <
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> >
> > > >>
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > where we can discuss the design initially. Once we
> > > come
> > > >> to
> > > >> > the
> > > >> > > > > > > initial
> > > >> > > > > > > > > > conclusion I can create an AIP on cWiki - can I
> ask
> > > for
> > > >> > > > > permission
> > > >> > > > > > to
> > > >> > > > > > > > do
> > > >> > > > > > > > > so
> > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that
> during
> > > the
> > > >> > > > > discussion
> > > >> > > > > > we
> > > >> > > > > > > > > > should definitely aim for this feature to be
> released
> > > >> only
> > > >> > > > after
> > > >> > > > > > > > Airflow
> > > >> > > > > > > > > > 2.0 is out.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > What do you think about this idea? Would you find
> such
> > > >> an
> > > >> > > > > operator
> > > >> > > > > > > > > helpful
> > > >> > > > > > > > > > in your pipelines? Maybe you already use a similar
> > > >> > solution or
> > > >> > > > > know
> > > >> > > > > > > > > > packages that could be used to implement it?
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Best regards,
> > > >> > > > > > > > > > --
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Kamil Olszewski
> > > >> > > > > > > > > > Polidea <https://www.polidea.com> | Software
> Engineer
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > M: +48 503 361 783
> > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Unique Tech
> > > >> > > > > > > > > > Check out our projects! <
> > > >> https://www.polidea.com/our-work>
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > --
> > > >> > > > > > > > >
> > > >> > > > > > > > > Jarek Potiuk
> > > >> > > > > > > > > Polidea <https://www.polidea.com/> | Principal
> Software
> > > >> > Engineer
> > > >> > > > > > > > >
> > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > --
> > > >> > > > > > >
> > > >> > > > > > > Kamil Olszewski
> > > >> > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > >> > > > > > >
> > > >> > > > > > > M: +48 503 361 783
> > > >> > > > > > > E: kamil.olszewski@polidea.com
> > > >> > > > > > >
> > > >> > > > > > > Unique Tech
> > > >> > > > > > > Check out our projects! <
> https://www.polidea.com/our-work>
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > >
> > > >> > > > > > Jarek Potiuk
> > > >> > > > > > Polidea <https://www.polidea.com/> | Principal Software
> > > >> Engineer
> > > >> > > > > >
> > > >> > > > > > M: +48 660 796 129 <+48660796129>
> > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > >> > > > > >
> > > >> > > > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> >
> > > >> > Tomasz Urbaszek
> > > >> > Polidea | Software Engineer
> > > >> >
> > > >> > M: +48 505 628 493
> > > >> > E: tomasz.urbaszek@polidea.com
> > > >> >
> > > >> > Unique Tech
> > > >> > Check out our projects!
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
>

Re: Generic Transfer Operator

Posted by Austin Bennett <wh...@gmail.com>.
I believe - for not large data - the direct runner is wholly doable, which
seems in line with airflow patterns.  I have, and have spoken with several
others that have, been productive with that runner.

For much larger transfers, the generic operator could accept parameters for
submitting the compute to an actual runner.  Though, imagining that
(needing a runner) would not be the primary use case for such an operator.


On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <tu...@apache.org> wrote:

> Austin, you are right, Beam covers all (and more) important IOs.
> However, using Apache Beam to design a generic transfer operator
> requires Airflow users to have additional resources that will be used
> as a runner (Spark, Flink, etc.). Unless you suggest using
> DirectRunner?
>
> Can you please tell us more how exactly you think we can use Beam for
> those Airflow transfer operators?
>
> Best,
> Tomek
>
>
> On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> <wh...@gmail.com> wrote:
> >
> > Are there IOs that would be desired for a generic transfer operator that
> > don't exist in:  https://beam.apache.org/documentation/io/built-in/ <-
> > there is pretty solid coverage?
> >
> > Beam is getting to the point where even python beam can leverage the java
> > IOs, which increases the range of IOs (and performance).
> >
> >
> >
> > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk <Ja...@polidea.com>
> > wrote:
> >
> > > But I believe those two ideas are separate ones as Tomek explained :)
> > >
> > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk <Jarek.Potiuk@polidea.com
> >
> > > wrote:
> > >
> > > > I love the idea of connecting the projects more closely!
> > > >
> > > > I've been helping recently as a consultant in improving the Apache
> Beam
> > > > build infrastructure (in many parts based on my Airflow experience
> and
> > > > Github Actions - even recently they adopted the "cancel" action I
> > > developed
> > > > for Apache Airflow). https://github.com/apache/beam/pull/12729
> > > >
> > > > Synergies in Apache projects are cool.
> > > >
> > > > J.
> > > >
> > > >
> > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > > > <gc...@twitter.com.invalid> wrote:
> > > >
> > > >> Agree on keeping those separate, just intervened as I believe its a
> > > great
> > > >> idea. But lets keep @beam and @spark to a separate thread.
> > > >>
> > > >>
> > > >> Gerard Casas Saez
> > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > > >>
> > > >>
> > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> turbaszek@apache.org>
> > > >> wrote:
> > > >>
> > > >> > Daniel is right we have few Apache Beam committers in Polidea so
> we
> > > >> > will ask for advice. However, I would be highly in favor of
> having it
> > > >> > as Gerard suggested as @beam decorator. This is something we
> should
> > > >> > put into another AIP together with the mentioned @spark decorator.
> > > >> >
> > > >> > Our proposition of transfer operators was mainly to create
> something
> > > >> > Airflow-native that works out of the box and allows us to simplify
> > > >> > read/write from external sources. Thus, it requires no external
> > > >> > dependency other than the library to communicate with the API. In
> the
> > > >> > case of Beam we need more than that I think.
> > > >> >
> > > >> > Additionally, the ideas of Source and Destination play nicely with
> > > >> > data lineage and may bring more interest to this feature of
> Airflow.
> > > >> >
> > > >> > Cheers,
> > > >> > Tomek
> > > >> >
> > > >> >
> > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <ka...@gmail.com>
> > > wrote:
> > > >> > >
> > > >> > > Nice. Just a note here, we will need to make sure that those
> > > "Source"
> > > >> and
> > > >> > > "Destination" needs to be serializable.
> > > >> > >
> > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > > daniel.imberman@gmail.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Interesting! Beam also could potentially allow transfers
> within
> > > >> > Dask/any
> > > >> > > > other system with a java/python SDK? I think @jarek and
> Polidea
> > > do a
> > > >> > lot of
> > > >> > > > work with Beam as well so I’d love their thoughts if this a
> good
> > > >> > use-case.
> > > >> > > >
> > > >> > > > via Newton Mail [
> > > >> > > >
> > > >> >
> > > >>
> > >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > >> > > > ]
> > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> > > >> > gcasassaez@twitter.com.invalid>
> > > >> > > > wrote:
> > > >> > > > I would be highly in favour of having a generic Beam operator.
> > > >> Similar
> > > >> > > > to @spark_task decorator. Something where you can easily
> define
> > > and
> > > >> > wrap a
> > > >> > > > beam pipeline and convert it to an Airflow operator.
> > > >> > > >
> > > >> > > > Gerard Casas Saez
> > > >> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > > >> > > >
> > > >> > > >
> > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > > >> > > > whatwouldaustindo@gmail.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Are you guys familiar with Beam <https://beam.apache.org>?
> Esp.
> > > >> if
> > > >> > not
> > > >> > > > > doing transforms, it might rather straightforward to rely
> on the
> > > >> > > > ecosystem
> > > >> > > > > of connectors in that Apache Project to use as the
> foundations
> > > >> for a
> > > >> > > > > generic transfer operator.
> > > >> > > > >
> > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > > >> > Jarek.Potiuk@polidea.com>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > +1
> > > >> > > > > >
> > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > > >> > > > > > kamil.olszewski@polidea.com>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hello all,
> > > >> > > > > > > since there have been no new comments shared in the POC
> doc
> > > >> > > > > > > <
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> >
> > > >>
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > >> > > > > > > >
> > > >> > > > > > > for a couple of days, then I will proceed with creating
> an
> > > AIP
> > > >> > for
> > > >> > > > this
> > > >> > > > > > > feature, if that is ok with everybody.
> > > >> > > > > > > Best regards,
> > > >> > > > > > > Kamil
> > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> > > >> > > > turbaszek@apache.org
> > > >> > > > > >
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > I like the approach as it itnroduces another
> interesting
> > > >> > operators'
> > > >> > > > > > > > interface standarization. It would be awesome to here
> more
> > > >> > opinions
> > > >> > > > > :)
> > > >> > > > > > > >
> > > >> > > > > > > > Cheers,
> > > >> > > > > > > > Tomek
> > > >> > > > > > > >
> > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > > >> > > > > Jarek.Potiuk@polidea.com
> > > >> > > > > > >
> > > >> > > > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > I like the idea a lot. Similar things have been
> > > discussed
> > > >> > before
> > > >> > > > > but
> > > >> > > > > > > the
> > > >> > > > > > > > > proposal is I think rather pragmatic and solves a
> real
> > > >> > problem
> > > >> > > > (and
> > > >> > > > > > it
> > > >> > > > > > > > does
> > > >> > > > > > > > > not seem to be too complex to implement)
> > > >> > > > > > > > >
> > > >> > > > > > > > > There is some discussion about it already in the
> > > document
> > > >> > (please
> > > >> > > > > > > > chime-in
> > > >> > > > > > > > > for those interested) but here a few points why I
> like
> > > it:
> > > >> > > > > > > > >
> > > >> > > > > > > > > - performance and optimization is not a focus for
> that.
> > > >> For
> > > >> > > > generic
> > > >> > > > > > > stuff
> > > >> > > > > > > > > it is usually to write "optimal" solution but once
> you
> > > >> admit
> > > >> > you
> > > >> > > > > are
> > > >> > > > > > > not
> > > >> > > > > > > > > going to focus for optimisation, you come with
> simpler
> > > and
> > > >> > easier
> > > >> > > > > to
> > > >> > > > > > > use
> > > >> > > > > > > > > solutions
> > > >> > > > > > > > >
> > > >> > > > > > > > > - on the other hand - it uses very "Python'y"
> approach
> > > >> with
> > > >> > using
> > > >> > > > > > > > > Airflow's familiar concepts (connection, transfer)
> and
> > > has
> > > >> > the
> > > >> > > > > > > potential
> > > >> > > > > > > > of
> > > >> > > > > > > > > plugging in into 100s of hooks we have already
> easily -
> > > >> > > > leveraging
> > > >> > > > > > all
> > > >> > > > > > > > the
> > > >> > > > > > > > > "providers" richness of Airflow.
> > > >> > > > > > > > >
> > > >> > > > > > > > > - it aims to be easy to do "quick start" - if you
> have a
> > > >> > number
> > > >> > > > of
> > > >> > > > > > > > > different sources/targets and as a data scientist
> you
> > > >> would
> > > >> > like
> > > >> > > > to
> > > >> > > > > > > > quickly
> > > >> > > > > > > > > start transferring data between them - you can do it
> > > >> easily
> > > >> > with
> > > >> > > > > > only
> > > >> > > > > > > > > basic python knowledge and simple DAG structure.
> > > >> > > > > > > > >
> > > >> > > > > > > > > - it should be possible to plug it in into our new
> > > >> functional
> > > >> > > > > > approach
> > > >> > > > > > > as
> > > >> > > > > > > > > well as future lineage discussions as it makes
> > > connection
> > > >> > between
> > > >> > > > > > > sources
> > > >> > > > > > > > > and targets
> > > >> > > > > > > > >
> > > >> > > > > > > > > - it opens up possibilities of adding simple and
> > > flexible
> > > >> > data
> > > >> > > > > > > > > transformation on-transfer. Not a replacement for
> any of
> > > >> the
> > > >> > > > > external
> > > >> > > > > > > > > services that Airflow should use (Airflow is an
> > > >> > orchestrator, not
> > > >> > > > > > data
> > > >> > > > > > > > > processing solution) but for the kind of quick-start
> > > >> > scenarios I
> > > >> > > > > > > foresee
> > > >> > > > > > > > it
> > > >> > > > > > > > > might be most useful, being able to apply simple
> data
> > > >> > > > > transformation
> > > >> > > > > > on
> > > >> > > > > > > > the
> > > >> > > > > > > > > fly by data scientist might be a big plus.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of the
> "data"
> > > >> > component
> > > >> > > > > > > > >
> > > >> > > > > > > > > Kamil - you should have access now.
> > > >> > > > > > > > >
> > > >> > > > > > > > > J.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > >> > > > > > > > > kamil.olszewski@polidea.com>
> > > >> > > > > > > > > wrote:
> > > >> > > > > > > > >
> > > >> > > > > > > > > > Hello all,
> > > >> > > > > > > > > > in Polidea we have come up with an idea for a
> generic
> > > >> > transfer
> > > >> > > > > > > operator
> > > >> > > > > > > > > > that would be able to transport data between two
> > > >> > destinations
> > > >> > > > of
> > > >> > > > > > > > various
> > > >> > > > > > > > > > types (file, database, storage, etc.) - please
> find
> > > the
> > > >> > link
> > > >> > > > > with a
> > > >> > > > > > > > short
> > > >> > > > > > > > > > doc with POC
> > > >> > > > > > > > > > <
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> >
> > > >>
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > where we can discuss the design initially. Once we
> > > come
> > > >> to
> > > >> > the
> > > >> > > > > > > initial
> > > >> > > > > > > > > > conclusion I can create an AIP on cWiki - can I
> ask
> > > for
> > > >> > > > > permission
> > > >> > > > > > to
> > > >> > > > > > > > do
> > > >> > > > > > > > > so
> > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that
> during
> > > the
> > > >> > > > > discussion
> > > >> > > > > > we
> > > >> > > > > > > > > > should definitely aim for this feature to be
> released
> > > >> only
> > > >> > > > after
> > > >> > > > > > > > Airflow
> > > >> > > > > > > > > > 2.0 is out.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > What do you think about this idea? Would you find
> such
> > > >> an
> > > >> > > > > operator
> > > >> > > > > > > > > helpful
> > > >> > > > > > > > > > in your pipelines? Maybe you already use a similar
> > > >> > solution or
> > > >> > > > > know
> > > >> > > > > > > > > > packages that could be used to implement it?
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Best regards,
> > > >> > > > > > > > > > --
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Kamil Olszewski
> > > >> > > > > > > > > > Polidea <https://www.polidea.com> | Software
> Engineer
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > M: +48 503 361 783
> > > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Unique Tech
> > > >> > > > > > > > > > Check out our projects! <
> > > >> https://www.polidea.com/our-work>
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > --
> > > >> > > > > > > > >
> > > >> > > > > > > > > Jarek Potiuk
> > > >> > > > > > > > > Polidea <https://www.polidea.com/> | Principal
> Software
> > > >> > Engineer
> > > >> > > > > > > > >
> > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > --
> > > >> > > > > > >
> > > >> > > > > > > Kamil Olszewski
> > > >> > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > >> > > > > > >
> > > >> > > > > > > M: +48 503 361 783
> > > >> > > > > > > E: kamil.olszewski@polidea.com
> > > >> > > > > > >
> > > >> > > > > > > Unique Tech
> > > >> > > > > > > Check out our projects! <
> https://www.polidea.com/our-work>
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > >
> > > >> > > > > > Jarek Potiuk
> > > >> > > > > > Polidea <https://www.polidea.com/> | Principal Software
> > > >> Engineer
> > > >> > > > > >
> > > >> > > > > > M: +48 660 796 129 <+48660796129>
> > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > >> > > > > >
> > > >> > > > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> >
> > > >> > Tomasz Urbaszek
> > > >> > Polidea | Software Engineer
> > > >> >
> > > >> > M: +48 505 628 493
> > > >> > E: tomasz.urbaszek@polidea.com
> > > >> >
> > > >> > Unique Tech
> > > >> > Check out our projects!
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
>

Re: Generic Transfer Operator

Posted by Tomasz Urbaszek <tu...@apache.org>.
Austin, you are right, Beam covers all (and more) important IOs.
However, using Apache Beam to design a generic transfer operator
requires Airflow users to have additional resources that will be used
as a runner (Spark, Flink, etc.). Unless you suggest using
DirectRunner?

Can you please tell us more how exactly you think we can use Beam for
those Airflow transfer operators?

Best,
Tomek


On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
<wh...@gmail.com> wrote:
>
> Are there IOs that would be desired for a generic transfer operator that
> don't exist in:  https://beam.apache.org/documentation/io/built-in/ <-
> there is pretty solid coverage?
>
> Beam is getting to the point where even python beam can leverage the java
> IOs, which increases the range of IOs (and performance).
>
>
>
> On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
> > But I believe those two ideas are separate ones as Tomek explained :)
> >
> > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk <Ja...@polidea.com>
> > wrote:
> >
> > > I love the idea of connecting the projects more closely!
> > >
> > > I've been helping recently as a consultant in improving the Apache Beam
> > > build infrastructure (in many parts based on my Airflow experience and
> > > Github Actions - even recently they adopted the "cancel" action I
> > developed
> > > for Apache Airflow). https://github.com/apache/beam/pull/12729
> > >
> > > Synergies in Apache projects are cool.
> > >
> > > J.
> > >
> > >
> > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > > <gc...@twitter.com.invalid> wrote:
> > >
> > >> Agree on keeping those separate, just intervened as I believe its a
> > great
> > >> idea. But lets keep @beam and @spark to a separate thread.
> > >>
> > >>
> > >> Gerard Casas Saez
> > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > >>
> > >>
> > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <tu...@apache.org>
> > >> wrote:
> > >>
> > >> > Daniel is right we have few Apache Beam committers in Polidea so we
> > >> > will ask for advice. However, I would be highly in favor of having it
> > >> > as Gerard suggested as @beam decorator. This is something we should
> > >> > put into another AIP together with the mentioned @spark decorator.
> > >> >
> > >> > Our proposition of transfer operators was mainly to create something
> > >> > Airflow-native that works out of the box and allows us to simplify
> > >> > read/write from external sources. Thus, it requires no external
> > >> > dependency other than the library to communicate with the API. In the
> > >> > case of Beam we need more than that I think.
> > >> >
> > >> > Additionally, the ideas of Source and Destination play nicely with
> > >> > data lineage and may bring more interest to this feature of Airflow.
> > >> >
> > >> > Cheers,
> > >> > Tomek
> > >> >
> > >> >
> > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <ka...@gmail.com>
> > wrote:
> > >> > >
> > >> > > Nice. Just a note here, we will need to make sure that those
> > "Source"
> > >> and
> > >> > > "Destination" needs to be serializable.
> > >> > >
> > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > daniel.imberman@gmail.com
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Interesting! Beam also could potentially allow transfers within
> > >> > Dask/any
> > >> > > > other system with a java/python SDK? I think @jarek and Polidea
> > do a
> > >> > lot of
> > >> > > > work with Beam as well so I’d love their thoughts if this a good
> > >> > use-case.
> > >> > > >
> > >> > > > via Newton Mail [
> > >> > > >
> > >> >
> > >>
> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > >> > > > ]
> > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> > >> > gcasassaez@twitter.com.invalid>
> > >> > > > wrote:
> > >> > > > I would be highly in favour of having a generic Beam operator.
> > >> Similar
> > >> > > > to @spark_task decorator. Something where you can easily define
> > and
> > >> > wrap a
> > >> > > > beam pipeline and convert it to an Airflow operator.
> > >> > > >
> > >> > > > Gerard Casas Saez
> > >> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > >> > > >
> > >> > > >
> > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > >> > > > whatwouldaustindo@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Are you guys familiar with Beam <https://beam.apache.org>? Esp.
> > >> if
> > >> > not
> > >> > > > > doing transforms, it might rather straightforward to rely on the
> > >> > > > ecosystem
> > >> > > > > of connectors in that Apache Project to use as the foundations
> > >> for a
> > >> > > > > generic transfer operator.
> > >> > > > >
> > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > >> > Jarek.Potiuk@polidea.com>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > +1
> > >> > > > > >
> > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > >> > > > > > kamil.olszewski@polidea.com>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hello all,
> > >> > > > > > > since there have been no new comments shared in the POC doc
> > >> > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> >
> > >>
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > >> > > > > > > >
> > >> > > > > > > for a couple of days, then I will proceed with creating an
> > AIP
> > >> > for
> > >> > > > this
> > >> > > > > > > feature, if that is ok with everybody.
> > >> > > > > > > Best regards,
> > >> > > > > > > Kamil
> > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> > >> > > > turbaszek@apache.org
> > >> > > > > >
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > I like the approach as it itnroduces another interesting
> > >> > operators'
> > >> > > > > > > > interface standarization. It would be awesome to here more
> > >> > opinions
> > >> > > > > :)
> > >> > > > > > > >
> > >> > > > > > > > Cheers,
> > >> > > > > > > > Tomek
> > >> > > > > > > >
> > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > >> > > > > Jarek.Potiuk@polidea.com
> > >> > > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > I like the idea a lot. Similar things have been
> > discussed
> > >> > before
> > >> > > > > but
> > >> > > > > > > the
> > >> > > > > > > > > proposal is I think rather pragmatic and solves a real
> > >> > problem
> > >> > > > (and
> > >> > > > > > it
> > >> > > > > > > > does
> > >> > > > > > > > > not seem to be too complex to implement)
> > >> > > > > > > > >
> > >> > > > > > > > > There is some discussion about it already in the
> > document
> > >> > (please
> > >> > > > > > > > chime-in
> > >> > > > > > > > > for those interested) but here a few points why I like
> > it:
> > >> > > > > > > > >
> > >> > > > > > > > > - performance and optimization is not a focus for that.
> > >> For
> > >> > > > generic
> > >> > > > > > > stuff
> > >> > > > > > > > > it is usually to write "optimal" solution but once you
> > >> admit
> > >> > you
> > >> > > > > are
> > >> > > > > > > not
> > >> > > > > > > > > going to focus for optimisation, you come with simpler
> > and
> > >> > easier
> > >> > > > > to
> > >> > > > > > > use
> > >> > > > > > > > > solutions
> > >> > > > > > > > >
> > >> > > > > > > > > - on the other hand - it uses very "Python'y" approach
> > >> with
> > >> > using
> > >> > > > > > > > > Airflow's familiar concepts (connection, transfer) and
> > has
> > >> > the
> > >> > > > > > > potential
> > >> > > > > > > > of
> > >> > > > > > > > > plugging in into 100s of hooks we have already easily -
> > >> > > > leveraging
> > >> > > > > > all
> > >> > > > > > > > the
> > >> > > > > > > > > "providers" richness of Airflow.
> > >> > > > > > > > >
> > >> > > > > > > > > - it aims to be easy to do "quick start" - if you have a
> > >> > number
> > >> > > > of
> > >> > > > > > > > > different sources/targets and as a data scientist you
> > >> would
> > >> > like
> > >> > > > to
> > >> > > > > > > > quickly
> > >> > > > > > > > > start transferring data between them - you can do it
> > >> easily
> > >> > with
> > >> > > > > > only
> > >> > > > > > > > > basic python knowledge and simple DAG structure.
> > >> > > > > > > > >
> > >> > > > > > > > > - it should be possible to plug it in into our new
> > >> functional
> > >> > > > > > approach
> > >> > > > > > > as
> > >> > > > > > > > > well as future lineage discussions as it makes
> > connection
> > >> > between
> > >> > > > > > > sources
> > >> > > > > > > > > and targets
> > >> > > > > > > > >
> > >> > > > > > > > > - it opens up possibilities of adding simple and
> > flexible
> > >> > data
> > >> > > > > > > > > transformation on-transfer. Not a replacement for any of
> > >> the
> > >> > > > > external
> > >> > > > > > > > > services that Airflow should use (Airflow is an
> > >> > orchestrator, not
> > >> > > > > > data
> > >> > > > > > > > > processing solution) but for the kind of quick-start
> > >> > scenarios I
> > >> > > > > > > foresee
> > >> > > > > > > > it
> > >> > > > > > > > > might be most useful, being able to apply simple data
> > >> > > > > transformation
> > >> > > > > > on
> > >> > > > > > > > the
> > >> > > > > > > > > fly by data scientist might be a big plus.
> > >> > > > > > > > >
> > >> > > > > > > > > Suggestion: Panda DataFrame as the format of the "data"
> > >> > component
> > >> > > > > > > > >
> > >> > > > > > > > > Kamil - you should have access now.
> > >> > > > > > > > >
> > >> > > > > > > > > J.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > >> > > > > > > > > kamil.olszewski@polidea.com>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Hello all,
> > >> > > > > > > > > > in Polidea we have come up with an idea for a generic
> > >> > transfer
> > >> > > > > > > operator
> > >> > > > > > > > > > that would be able to transport data between two
> > >> > destinations
> > >> > > > of
> > >> > > > > > > > various
> > >> > > > > > > > > > types (file, database, storage, etc.) - please find
> > the
> > >> > link
> > >> > > > > with a
> > >> > > > > > > > short
> > >> > > > > > > > > > doc with POC
> > >> > > > > > > > > > <
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> >
> > >>
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > >> > > > > > > > > > >
> > >> > > > > > > > > > where we can discuss the design initially. Once we
> > come
> > >> to
> > >> > the
> > >> > > > > > > initial
> > >> > > > > > > > > > conclusion I can create an AIP on cWiki - can I ask
> > for
> > >> > > > > permission
> > >> > > > > > to
> > >> > > > > > > > do
> > >> > > > > > > > > so
> > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that during
> > the
> > >> > > > > discussion
> > >> > > > > > we
> > >> > > > > > > > > > should definitely aim for this feature to be released
> > >> only
> > >> > > > after
> > >> > > > > > > > Airflow
> > >> > > > > > > > > > 2.0 is out.
> > >> > > > > > > > > >
> > >> > > > > > > > > > What do you think about this idea? Would you find such
> > >> an
> > >> > > > > operator
> > >> > > > > > > > > helpful
> > >> > > > > > > > > > in your pipelines? Maybe you already use a similar
> > >> > solution or
> > >> > > > > know
> > >> > > > > > > > > > packages that could be used to implement it?
> > >> > > > > > > > > >
> > >> > > > > > > > > > Best regards,
> > >> > > > > > > > > > --
> > >> > > > > > > > > >
> > >> > > > > > > > > > Kamil Olszewski
> > >> > > > > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > >> > > > > > > > > >
> > >> > > > > > > > > > M: +48 503 361 783
> > >> > > > > > > > > > E: kamil.olszewski@polidea.com
> > >> > > > > > > > > >
> > >> > > > > > > > > > Unique Tech
> > >> > > > > > > > > > Check out our projects! <
> > >> https://www.polidea.com/our-work>
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > --
> > >> > > > > > > > >
> > >> > > > > > > > > Jarek Potiuk
> > >> > > > > > > > > Polidea <https://www.polidea.com/> | Principal Software
> > >> > Engineer
> > >> > > > > > > > >
> > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > --
> > >> > > > > > >
> > >> > > > > > > Kamil Olszewski
> > >> > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > >> > > > > > >
> > >> > > > > > > M: +48 503 361 783
> > >> > > > > > > E: kamil.olszewski@polidea.com
> > >> > > > > > >
> > >> > > > > > > Unique Tech
> > >> > > > > > > Check out our projects! <https://www.polidea.com/our-work>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > >
> > >> > > > > > Jarek Potiuk
> > >> > > > > > Polidea <https://www.polidea.com/> | Principal Software
> > >> Engineer
> > >> > > > > >
> > >> > > > > > M: +48 660 796 129 <+48660796129>
> > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > >> > > > > >
> > >> > > > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > Tomasz Urbaszek
> > >> > Polidea | Software Engineer
> > >> >
> > >> > M: +48 505 628 493
> > >> > E: tomasz.urbaszek@polidea.com
> > >> >
> > >> > Unique Tech
> > >> > Check out our projects!
> > >> >
> > >>
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> > >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >

Re: Generic Transfer Operator

Posted by Austin Bennett <wh...@gmail.com>.
Are there IOs that would be desired for a generic transfer operator that
don't exist in:  https://beam.apache.org/documentation/io/built-in/ <-
there is pretty solid coverage?

Beam is getting to the point where even python beam can leverage the java
IOs, which increases the range of IOs (and performance).



On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk <Ja...@polidea.com>
wrote:

> But I believe those two ideas are separate ones as Tomek explained :)
>
> On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
> > I love the idea of connecting the projects more closely!
> >
> > I've been helping recently as a consultant in improving the Apache Beam
> > build infrastructure (in many parts based on my Airflow experience and
> > Github Actions - even recently they adopted the "cancel" action I
> developed
> > for Apache Airflow). https://github.com/apache/beam/pull/12729
> >
> > Synergies in Apache projects are cool.
> >
> > J.
> >
> >
> > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > <gc...@twitter.com.invalid> wrote:
> >
> >> Agree on keeping those separate, just intervened as I believe its a
> great
> >> idea. But lets keep @beam and @spark to a separate thread.
> >>
> >>
> >> Gerard Casas Saez
> >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> >>
> >>
> >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <tu...@apache.org>
> >> wrote:
> >>
> >> > Daniel is right we have few Apache Beam committers in Polidea so we
> >> > will ask for advice. However, I would be highly in favor of having it
> >> > as Gerard suggested as @beam decorator. This is something we should
> >> > put into another AIP together with the mentioned @spark decorator.
> >> >
> >> > Our proposition of transfer operators was mainly to create something
> >> > Airflow-native that works out of the box and allows us to simplify
> >> > read/write from external sources. Thus, it requires no external
> >> > dependency other than the library to communicate with the API. In the
> >> > case of Beam we need more than that I think.
> >> >
> >> > Additionally, the ideas of Source and Destination play nicely with
> >> > data lineage and may bring more interest to this feature of Airflow.
> >> >
> >> > Cheers,
> >> > Tomek
> >> >
> >> >
> >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <ka...@gmail.com>
> wrote:
> >> > >
> >> > > Nice. Just a note here, we will need to make sure that those
> "Source"
> >> and
> >> > > "Destination" needs to be serializable.
> >> > >
> >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> daniel.imberman@gmail.com
> >> >
> >> > > wrote:
> >> > >
> >> > > > Interesting! Beam also could potentially allow transfers within
> >> > Dask/any
> >> > > > other system with a java/python SDK? I think @jarek and Polidea
> do a
> >> > lot of
> >> > > > work with Beam as well so I’d love their thoughts if this a good
> >> > use-case.
> >> > > >
> >> > > > via Newton Mail [
> >> > > >
> >> >
> >>
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> >> > > > ]
> >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> >> > gcasassaez@twitter.com.invalid>
> >> > > > wrote:
> >> > > > I would be highly in favour of having a generic Beam operator.
> >> Similar
> >> > > > to @spark_task decorator. Something where you can easily define
> and
> >> > wrap a
> >> > > > beam pipeline and convert it to an Airflow operator.
> >> > > >
> >> > > > Gerard Casas Saez
> >> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> >> > > >
> >> > > >
> >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> >> > > > whatwouldaustindo@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Are you guys familiar with Beam <https://beam.apache.org>? Esp.
> >> if
> >> > not
> >> > > > > doing transforms, it might rather straightforward to rely on the
> >> > > > ecosystem
> >> > > > > of connectors in that Apache Project to use as the foundations
> >> for a
> >> > > > > generic transfer operator.
> >> > > > >
> >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> >> > Jarek.Potiuk@polidea.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > +1
> >> > > > > >
> >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> >> > > > > > kamil.olszewski@polidea.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hello all,
> >> > > > > > > since there have been no new comments shared in the POC doc
> >> > > > > > > <
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> >
> >>
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> >> > > > > > > >
> >> > > > > > > for a couple of days, then I will proceed with creating an
> AIP
> >> > for
> >> > > > this
> >> > > > > > > feature, if that is ok with everybody.
> >> > > > > > > Best regards,
> >> > > > > > > Kamil
> >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> >> > > > turbaszek@apache.org
> >> > > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > I like the approach as it itnroduces another interesting
> >> > operators'
> >> > > > > > > > interface standarization. It would be awesome to here more
> >> > opinions
> >> > > > > :)
> >> > > > > > > >
> >> > > > > > > > Cheers,
> >> > > > > > > > Tomek
> >> > > > > > > >
> >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> >> > > > > Jarek.Potiuk@polidea.com
> >> > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > I like the idea a lot. Similar things have been
> discussed
> >> > before
> >> > > > > but
> >> > > > > > > the
> >> > > > > > > > > proposal is I think rather pragmatic and solves a real
> >> > problem
> >> > > > (and
> >> > > > > > it
> >> > > > > > > > does
> >> > > > > > > > > not seem to be too complex to implement)
> >> > > > > > > > >
> >> > > > > > > > > There is some discussion about it already in the
> document
> >> > (please
> >> > > > > > > > chime-in
> >> > > > > > > > > for those interested) but here a few points why I like
> it:
> >> > > > > > > > >
> >> > > > > > > > > - performance and optimization is not a focus for that.
> >> For
> >> > > > generic
> >> > > > > > > stuff
> >> > > > > > > > > it is usually to write "optimal" solution but once you
> >> admit
> >> > you
> >> > > > > are
> >> > > > > > > not
> >> > > > > > > > > going to focus for optimisation, you come with simpler
> and
> >> > easier
> >> > > > > to
> >> > > > > > > use
> >> > > > > > > > > solutions
> >> > > > > > > > >
> >> > > > > > > > > - on the other hand - it uses very "Python'y" approach
> >> with
> >> > using
> >> > > > > > > > > Airflow's familiar concepts (connection, transfer) and
> has
> >> > the
> >> > > > > > > potential
> >> > > > > > > > of
> >> > > > > > > > > plugging in into 100s of hooks we have already easily -
> >> > > > leveraging
> >> > > > > > all
> >> > > > > > > > the
> >> > > > > > > > > "providers" richness of Airflow.
> >> > > > > > > > >
> >> > > > > > > > > - it aims to be easy to do "quick start" - if you have a
> >> > number
> >> > > > of
> >> > > > > > > > > different sources/targets and as a data scientist you
> >> would
> >> > like
> >> > > > to
> >> > > > > > > > quickly
> >> > > > > > > > > start transferring data between them - you can do it
> >> easily
> >> > with
> >> > > > > > only
> >> > > > > > > > > basic python knowledge and simple DAG structure.
> >> > > > > > > > >
> >> > > > > > > > > - it should be possible to plug it in into our new
> >> functional
> >> > > > > > approach
> >> > > > > > > as
> >> > > > > > > > > well as future lineage discussions as it makes
> connection
> >> > between
> >> > > > > > > sources
> >> > > > > > > > > and targets
> >> > > > > > > > >
> >> > > > > > > > > - it opens up possibilities of adding simple and
> flexible
> >> > data
> >> > > > > > > > > transformation on-transfer. Not a replacement for any of
> >> the
> >> > > > > external
> >> > > > > > > > > services that Airflow should use (Airflow is an
> >> > orchestrator, not
> >> > > > > > data
> >> > > > > > > > > processing solution) but for the kind of quick-start
> >> > scenarios I
> >> > > > > > > foresee
> >> > > > > > > > it
> >> > > > > > > > > might be most useful, being able to apply simple data
> >> > > > > transformation
> >> > > > > > on
> >> > > > > > > > the
> >> > > > > > > > > fly by data scientist might be a big plus.
> >> > > > > > > > >
> >> > > > > > > > > Suggestion: Panda DataFrame as the format of the "data"
> >> > component
> >> > > > > > > > >
> >> > > > > > > > > Kamil - you should have access now.
> >> > > > > > > > >
> >> > > > > > > > > J.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> >> > > > > > > > > kamil.olszewski@polidea.com>
> >> > > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Hello all,
> >> > > > > > > > > > in Polidea we have come up with an idea for a generic
> >> > transfer
> >> > > > > > > operator
> >> > > > > > > > > > that would be able to transport data between two
> >> > destinations
> >> > > > of
> >> > > > > > > > various
> >> > > > > > > > > > types (file, database, storage, etc.) - please find
> the
> >> > link
> >> > > > > with a
> >> > > > > > > > short
> >> > > > > > > > > > doc with POC
> >> > > > > > > > > > <
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> >
> >>
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> >> > > > > > > > > > >
> >> > > > > > > > > > where we can discuss the design initially. Once we
> come
> >> to
> >> > the
> >> > > > > > > initial
> >> > > > > > > > > > conclusion I can create an AIP on cWiki - can I ask
> for
> >> > > > > permission
> >> > > > > > to
> >> > > > > > > > do
> >> > > > > > > > > so
> >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that during
> the
> >> > > > > discussion
> >> > > > > > we
> >> > > > > > > > > > should definitely aim for this feature to be released
> >> only
> >> > > > after
> >> > > > > > > > Airflow
> >> > > > > > > > > > 2.0 is out.
> >> > > > > > > > > >
> >> > > > > > > > > > What do you think about this idea? Would you find such
> >> an
> >> > > > > operator
> >> > > > > > > > > helpful
> >> > > > > > > > > > in your pipelines? Maybe you already use a similar
> >> > solution or
> >> > > > > know
> >> > > > > > > > > > packages that could be used to implement it?
> >> > > > > > > > > >
> >> > > > > > > > > > Best regards,
> >> > > > > > > > > > --
> >> > > > > > > > > >
> >> > > > > > > > > > Kamil Olszewski
> >> > > > > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> >> > > > > > > > > >
> >> > > > > > > > > > M: +48 503 361 783
> >> > > > > > > > > > E: kamil.olszewski@polidea.com
> >> > > > > > > > > >
> >> > > > > > > > > > Unique Tech
> >> > > > > > > > > > Check out our projects! <
> >> https://www.polidea.com/our-work>
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > >
> >> > > > > > > > > Jarek Potiuk
> >> > > > > > > > > Polidea <https://www.polidea.com/> | Principal Software
> >> > Engineer
> >> > > > > > > > >
> >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > >
> >> > > > > > > Kamil Olszewski
> >> > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> >> > > > > > >
> >> > > > > > > M: +48 503 361 783
> >> > > > > > > E: kamil.olszewski@polidea.com
> >> > > > > > >
> >> > > > > > > Unique Tech
> >> > > > > > > Check out our projects! <https://www.polidea.com/our-work>
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > >
> >> > > > > > Jarek Potiuk
> >> > > > > > Polidea <https://www.polidea.com/> | Principal Software
> >> Engineer
> >> > > > > >
> >> > > > > > M: +48 660 796 129 <+48660796129>
> >> > > > > > [image: Polidea] <https://www.polidea.com/>
> >> > > > > >
> >> > > > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Tomasz Urbaszek
> >> > Polidea | Software Engineer
> >> >
> >> > M: +48 505 628 493
> >> > E: tomasz.urbaszek@polidea.com
> >> >
> >> > Unique Tech
> >> > Check out our projects!
> >> >
> >>
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
> >
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Re: Generic Transfer Operator

Posted by Jarek Potiuk <Ja...@polidea.com>.
But I believe those two ideas are separate ones as Tomek explained :)

On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk <Ja...@polidea.com>
wrote:

> I love the idea of connecting the projects more closely!
>
> I've been helping recently as a consultant in improving the Apache Beam
> build infrastructure (in many parts based on my Airflow experience and
> Github Actions - even recently they adopted the "cancel" action I developed
> for Apache Airflow). https://github.com/apache/beam/pull/12729
>
> Synergies in Apache projects are cool.
>
> J.
>
>
> On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> <gc...@twitter.com.invalid> wrote:
>
>> Agree on keeping those separate, just intervened as I believe its a great
>> idea. But lets keep @beam and @spark to a separate thread.
>>
>>
>> Gerard Casas Saez
>> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>>
>>
>> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <tu...@apache.org>
>> wrote:
>>
>> > Daniel is right we have few Apache Beam committers in Polidea so we
>> > will ask for advice. However, I would be highly in favor of having it
>> > as Gerard suggested as @beam decorator. This is something we should
>> > put into another AIP together with the mentioned @spark decorator.
>> >
>> > Our proposition of transfer operators was mainly to create something
>> > Airflow-native that works out of the box and allows us to simplify
>> > read/write from external sources. Thus, it requires no external
>> > dependency other than the library to communicate with the API. In the
>> > case of Beam we need more than that I think.
>> >
>> > Additionally, the ideas of Source and Destination play nicely with
>> > data lineage and may bring more interest to this feature of Airflow.
>> >
>> > Cheers,
>> > Tomek
>> >
>> >
>> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <ka...@gmail.com> wrote:
>> > >
>> > > Nice. Just a note here, we will need to make sure that those "Source"
>> and
>> > > "Destination" needs to be serializable.
>> > >
>> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <daniel.imberman@gmail.com
>> >
>> > > wrote:
>> > >
>> > > > Interesting! Beam also could potentially allow transfers within
>> > Dask/any
>> > > > other system with a java/python SDK? I think @jarek and Polidea do a
>> > lot of
>> > > > work with Beam as well so I’d love their thoughts if this a good
>> > use-case.
>> > > >
>> > > > via Newton Mail [
>> > > >
>> >
>> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
>> > > > ]
>> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
>> > gcasassaez@twitter.com.invalid>
>> > > > wrote:
>> > > > I would be highly in favour of having a generic Beam operator.
>> Similar
>> > > > to @spark_task decorator. Something where you can easily define and
>> > wrap a
>> > > > beam pipeline and convert it to an Airflow operator.
>> > > >
>> > > > Gerard Casas Saez
>> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>> > > >
>> > > >
>> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
>> > > > whatwouldaustindo@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Are you guys familiar with Beam <https://beam.apache.org>? Esp.
>> if
>> > not
>> > > > > doing transforms, it might rather straightforward to rely on the
>> > > > ecosystem
>> > > > > of connectors in that Apache Project to use as the foundations
>> for a
>> > > > > generic transfer operator.
>> > > > >
>> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
>> > Jarek.Potiuk@polidea.com>
>> > > > > wrote:
>> > > > >
>> > > > > > +1
>> > > > > >
>> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
>> > > > > > kamil.olszewski@polidea.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hello all,
>> > > > > > > since there have been no new comments shared in the POC doc
>> > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> >
>> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
>> > > > > > > >
>> > > > > > > for a couple of days, then I will proceed with creating an AIP
>> > for
>> > > > this
>> > > > > > > feature, if that is ok with everybody.
>> > > > > > > Best regards,
>> > > > > > > Kamil
>> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
>> > > > turbaszek@apache.org
>> > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > I like the approach as it itnroduces another interesting
>> > operators'
>> > > > > > > > interface standarization. It would be awesome to here more
>> > opinions
>> > > > > :)
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > > Tomek
>> > > > > > > >
>> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
>> > > > > Jarek.Potiuk@polidea.com
>> > > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > I like the idea a lot. Similar things have been discussed
>> > before
>> > > > > but
>> > > > > > > the
>> > > > > > > > > proposal is I think rather pragmatic and solves a real
>> > problem
>> > > > (and
>> > > > > > it
>> > > > > > > > does
>> > > > > > > > > not seem to be too complex to implement)
>> > > > > > > > >
>> > > > > > > > > There is some discussion about it already in the document
>> > (please
>> > > > > > > > chime-in
>> > > > > > > > > for those interested) but here a few points why I like it:
>> > > > > > > > >
>> > > > > > > > > - performance and optimization is not a focus for that.
>> For
>> > > > generic
>> > > > > > > stuff
>> > > > > > > > > it is usually to write "optimal" solution but once you
>> admit
>> > you
>> > > > > are
>> > > > > > > not
>> > > > > > > > > going to focus for optimisation, you come with simpler and
>> > easier
>> > > > > to
>> > > > > > > use
>> > > > > > > > > solutions
>> > > > > > > > >
>> > > > > > > > > - on the other hand - it uses very "Python'y" approach
>> with
>> > using
>> > > > > > > > > Airflow's familiar concepts (connection, transfer) and has
>> > the
>> > > > > > > potential
>> > > > > > > > of
>> > > > > > > > > plugging in into 100s of hooks we have already easily -
>> > > > leveraging
>> > > > > > all
>> > > > > > > > the
>> > > > > > > > > "providers" richness of Airflow.
>> > > > > > > > >
>> > > > > > > > > - it aims to be easy to do "quick start" - if you have a
>> > number
>> > > > of
>> > > > > > > > > different sources/targets and as a data scientist you
>> would
>> > like
>> > > > to
>> > > > > > > > quickly
>> > > > > > > > > start transferring data between them - you can do it
>> easily
>> > with
>> > > > > > only
>> > > > > > > > > basic python knowledge and simple DAG structure.
>> > > > > > > > >
>> > > > > > > > > - it should be possible to plug it in into our new
>> functional
>> > > > > > approach
>> > > > > > > as
>> > > > > > > > > well as future lineage discussions as it makes connection
>> > between
>> > > > > > > sources
>> > > > > > > > > and targets
>> > > > > > > > >
>> > > > > > > > > - it opens up possibilities of adding simple and flexible
>> > data
>> > > > > > > > > transformation on-transfer. Not a replacement for any of
>> the
>> > > > > external
>> > > > > > > > > services that Airflow should use (Airflow is an
>> > orchestrator, not
>> > > > > > data
>> > > > > > > > > processing solution) but for the kind of quick-start
>> > scenarios I
>> > > > > > > foresee
>> > > > > > > > it
>> > > > > > > > > might be most useful, being able to apply simple data
>> > > > > transformation
>> > > > > > on
>> > > > > > > > the
>> > > > > > > > > fly by data scientist might be a big plus.
>> > > > > > > > >
>> > > > > > > > > Suggestion: Panda DataFrame as the format of the "data"
>> > component
>> > > > > > > > >
>> > > > > > > > > Kamil - you should have access now.
>> > > > > > > > >
>> > > > > > > > > J.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
>> > > > > > > > > kamil.olszewski@polidea.com>
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Hello all,
>> > > > > > > > > > in Polidea we have come up with an idea for a generic
>> > transfer
>> > > > > > > operator
>> > > > > > > > > > that would be able to transport data between two
>> > destinations
>> > > > of
>> > > > > > > > various
>> > > > > > > > > > types (file, database, storage, etc.) - please find the
>> > link
>> > > > > with a
>> > > > > > > > short
>> > > > > > > > > > doc with POC
>> > > > > > > > > > <
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> >
>> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
>> > > > > > > > > > >
>> > > > > > > > > > where we can discuss the design initially. Once we come
>> to
>> > the
>> > > > > > > initial
>> > > > > > > > > > conclusion I can create an AIP on cWiki - can I ask for
>> > > > > permission
>> > > > > > to
>> > > > > > > > do
>> > > > > > > > > so
>> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that during the
>> > > > > discussion
>> > > > > > we
>> > > > > > > > > > should definitely aim for this feature to be released
>> only
>> > > > after
>> > > > > > > > Airflow
>> > > > > > > > > > 2.0 is out.
>> > > > > > > > > >
>> > > > > > > > > > What do you think about this idea? Would you find such
>> an
>> > > > > operator
>> > > > > > > > > helpful
>> > > > > > > > > > in your pipelines? Maybe you already use a similar
>> > solution or
>> > > > > know
>> > > > > > > > > > packages that could be used to implement it?
>> > > > > > > > > >
>> > > > > > > > > > Best regards,
>> > > > > > > > > > --
>> > > > > > > > > >
>> > > > > > > > > > Kamil Olszewski
>> > > > > > > > > > Polidea <https://www.polidea.com> | Software Engineer
>> > > > > > > > > >
>> > > > > > > > > > M: +48 503 361 783
>> > > > > > > > > > E: kamil.olszewski@polidea.com
>> > > > > > > > > >
>> > > > > > > > > > Unique Tech
>> > > > > > > > > > Check out our projects! <
>> https://www.polidea.com/our-work>
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > >
>> > > > > > > > > Jarek Potiuk
>> > > > > > > > > Polidea <https://www.polidea.com/> | Principal Software
>> > Engineer
>> > > > > > > > >
>> > > > > > > > > M: +48 660 796 129 <+48660796129>
>> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > >
>> > > > > > > Kamil Olszewski
>> > > > > > > Polidea <https://www.polidea.com> | Software Engineer
>> > > > > > >
>> > > > > > > M: +48 503 361 783
>> > > > > > > E: kamil.olszewski@polidea.com
>> > > > > > >
>> > > > > > > Unique Tech
>> > > > > > > Check out our projects! <https://www.polidea.com/our-work>
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > >
>> > > > > > Jarek Potiuk
>> > > > > > Polidea <https://www.polidea.com/> | Principal Software
>> Engineer
>> > > > > >
>> > > > > > M: +48 660 796 129 <+48660796129>
>> > > > > > [image: Polidea] <https://www.polidea.com/>
>> > > > > >
>> > > > >
>> >
>> >
>> >
>> > --
>> >
>> > Tomasz Urbaszek
>> > Polidea | Software Engineer
>> >
>> > M: +48 505 628 493
>> > E: tomasz.urbaszek@polidea.com
>> >
>> > Unique Tech
>> > Check out our projects!
>> >
>>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Generic Transfer Operator

Posted by Jarek Potiuk <Ja...@polidea.com>.
I love the idea of connecting the projects more closely!

I've been helping recently as a consultant in improving the Apache Beam
build infrastructure (in many parts based on my Airflow experience and
Github Actions - even recently they adopted the "cancel" action I developed
for Apache Airflow). https://github.com/apache/beam/pull/12729

Synergies in Apache projects are cool.

J.


On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
<gc...@twitter.com.invalid> wrote:

> Agree on keeping those separate, just intervened as I believe its a great
> idea. But lets keep @beam and @spark to a separate thread.
>
>
> Gerard Casas Saez
> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>
>
> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <tu...@apache.org>
> wrote:
>
> > Daniel is right we have few Apache Beam committers in Polidea so we
> > will ask for advice. However, I would be highly in favor of having it
> > as Gerard suggested as @beam decorator. This is something we should
> > put into another AIP together with the mentioned @spark decorator.
> >
> > Our proposition of transfer operators was mainly to create something
> > Airflow-native that works out of the box and allows us to simplify
> > read/write from external sources. Thus, it requires no external
> > dependency other than the library to communicate with the API. In the
> > case of Beam we need more than that I think.
> >
> > Additionally, the ideas of Source and Destination play nicely with
> > data lineage and may bring more interest to this feature of Airflow.
> >
> > Cheers,
> > Tomek
> >
> >
> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <ka...@gmail.com> wrote:
> > >
> > > Nice. Just a note here, we will need to make sure that those "Source"
> and
> > > "Destination" needs to be serializable.
> > >
> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <da...@gmail.com>
> > > wrote:
> > >
> > > > Interesting! Beam also could potentially allow transfers within
> > Dask/any
> > > > other system with a java/python SDK? I think @jarek and Polidea do a
> > lot of
> > > > work with Beam as well so I’d love their thoughts if this a good
> > use-case.
> > > >
> > > > via Newton Mail [
> > > >
> >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > > ]
> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> > gcasassaez@twitter.com.invalid>
> > > > wrote:
> > > > I would be highly in favour of having a generic Beam operator.
> Similar
> > > > to @spark_task decorator. Something where you can easily define and
> > wrap a
> > > > beam pipeline and convert it to an Airflow operator.
> > > >
> > > > Gerard Casas Saez
> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > > >
> > > >
> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > > > whatwouldaustindo@gmail.com>
> > > > wrote:
> > > >
> > > > > Are you guys familiar with Beam <https://beam.apache.org>? Esp. if
> > not
> > > > > doing transforms, it might rather straightforward to rely on the
> > > > ecosystem
> > > > > of connectors in that Apache Project to use as the foundations for
> a
> > > > > generic transfer operator.
> > > > >
> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > Jarek.Potiuk@polidea.com>
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > > > > > kamil.olszewski@polidea.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello all,
> > > > > > > since there have been no new comments shared in the POC doc
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > > > > > >
> > > > > > > for a couple of days, then I will proceed with creating an AIP
> > for
> > > > this
> > > > > > > feature, if that is ok with everybody.
> > > > > > > Best regards,
> > > > > > > Kamil
> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> > > > turbaszek@apache.org
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I like the approach as it itnroduces another interesting
> > operators'
> > > > > > > > interface standarization. It would be awesome to here more
> > opinions
> > > > > :)
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Tomek
> > > > > > > >
> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > > > > Jarek.Potiuk@polidea.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I like the idea a lot. Similar things have been discussed
> > before
> > > > > but
> > > > > > > the
> > > > > > > > > proposal is I think rather pragmatic and solves a real
> > problem
> > > > (and
> > > > > > it
> > > > > > > > does
> > > > > > > > > not seem to be too complex to implement)
> > > > > > > > >
> > > > > > > > > There is some discussion about it already in the document
> > (please
> > > > > > > > chime-in
> > > > > > > > > for those interested) but here a few points why I like it:
> > > > > > > > >
> > > > > > > > > - performance and optimization is not a focus for that. For
> > > > generic
> > > > > > > stuff
> > > > > > > > > it is usually to write "optimal" solution but once you
> admit
> > you
> > > > > are
> > > > > > > not
> > > > > > > > > going to focus for optimisation, you come with simpler and
> > easier
> > > > > to
> > > > > > > use
> > > > > > > > > solutions
> > > > > > > > >
> > > > > > > > > - on the other hand - it uses very "Python'y" approach with
> > using
> > > > > > > > > Airflow's familiar concepts (connection, transfer) and has
> > the
> > > > > > > potential
> > > > > > > > of
> > > > > > > > > plugging in into 100s of hooks we have already easily -
> > > > leveraging
> > > > > > all
> > > > > > > > the
> > > > > > > > > "providers" richness of Airflow.
> > > > > > > > >
> > > > > > > > > - it aims to be easy to do "quick start" - if you have a
> > number
> > > > of
> > > > > > > > > different sources/targets and as a data scientist you would
> > like
> > > > to
> > > > > > > > quickly
> > > > > > > > > start transferring data between them - you can do it easily
> > with
> > > > > > only
> > > > > > > > > basic python knowledge and simple DAG structure.
> > > > > > > > >
> > > > > > > > > - it should be possible to plug it in into our new
> functional
> > > > > > approach
> > > > > > > as
> > > > > > > > > well as future lineage discussions as it makes connection
> > between
> > > > > > > sources
> > > > > > > > > and targets
> > > > > > > > >
> > > > > > > > > - it opens up possibilities of adding simple and flexible
> > data
> > > > > > > > > transformation on-transfer. Not a replacement for any of
> the
> > > > > external
> > > > > > > > > services that Airflow should use (Airflow is an
> > orchestrator, not
> > > > > > data
> > > > > > > > > processing solution) but for the kind of quick-start
> > scenarios I
> > > > > > > foresee
> > > > > > > > it
> > > > > > > > > might be most useful, being able to apply simple data
> > > > > transformation
> > > > > > on
> > > > > > > > the
> > > > > > > > > fly by data scientist might be a big plus.
> > > > > > > > >
> > > > > > > > > Suggestion: Panda DataFrame as the format of the "data"
> > component
> > > > > > > > >
> > > > > > > > > Kamil - you should have access now.
> > > > > > > > >
> > > > > > > > > J.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > > > > > > > kamil.olszewski@polidea.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello all,
> > > > > > > > > > in Polidea we have come up with an idea for a generic
> > transfer
> > > > > > > operator
> > > > > > > > > > that would be able to transport data between two
> > destinations
> > > > of
> > > > > > > > various
> > > > > > > > > > types (file, database, storage, etc.) - please find the
> > link
> > > > > with a
> > > > > > > > short
> > > > > > > > > > doc with POC
> > > > > > > > > > <
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > > > > > > > >
> > > > > > > > > > where we can discuss the design initially. Once we come
> to
> > the
> > > > > > > initial
> > > > > > > > > > conclusion I can create an AIP on cWiki - can I ask for
> > > > > permission
> > > > > > to
> > > > > > > > do
> > > > > > > > > so
> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that during the
> > > > > discussion
> > > > > > we
> > > > > > > > > > should definitely aim for this feature to be released
> only
> > > > after
> > > > > > > > Airflow
> > > > > > > > > > 2.0 is out.
> > > > > > > > > >
> > > > > > > > > > What do you think about this idea? Would you find such an
> > > > > operator
> > > > > > > > > helpful
> > > > > > > > > > in your pipelines? Maybe you already use a similar
> > solution or
> > > > > know
> > > > > > > > > > packages that could be used to implement it?
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > > > Kamil Olszewski
> > > > > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > > > > > > >
> > > > > > > > > > M: +48 503 361 783
> > > > > > > > > > E: kamil.olszewski@polidea.com
> > > > > > > > > >
> > > > > > > > > > Unique Tech
> > > > > > > > > > Check out our projects! <
> https://www.polidea.com/our-work>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Jarek Potiuk
> > > > > > > > > Polidea <https://www.polidea.com/> | Principal Software
> > Engineer
> > > > > > > > >
> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Kamil Olszewski
> > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > > > >
> > > > > > > M: +48 503 361 783
> > > > > > > E: kamil.olszewski@polidea.com
> > > > > > >
> > > > > > > Unique Tech
> > > > > > > Check out our projects! <https://www.polidea.com/our-work>
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Jarek Potiuk
> > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > > >
> > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > >
> > > > >
> >
> >
> >
> > --
> >
> > Tomasz Urbaszek
> > Polidea | Software Engineer
> >
> > M: +48 505 628 493
> > E: tomasz.urbaszek@polidea.com
> >
> > Unique Tech
> > Check out our projects!
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Generic Transfer Operator

Posted by Gerard Casas Saez <gc...@twitter.com.INVALID>.
Agree on keeping those separate, just intervened as I believe its a great
idea. But lets keep @beam and @spark to a separate thread.


Gerard Casas Saez
Twitter | Cortex | @casassaez <http://twitter.com/casassaez>


On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <tu...@apache.org> wrote:

> Daniel is right we have few Apache Beam committers in Polidea so we
> will ask for advice. However, I would be highly in favor of having it
> as Gerard suggested as @beam decorator. This is something we should
> put into another AIP together with the mentioned @spark decorator.
>
> Our proposition of transfer operators was mainly to create something
> Airflow-native that works out of the box and allows us to simplify
> read/write from external sources. Thus, it requires no external
> dependency other than the library to communicate with the API. In the
> case of Beam we need more than that I think.
>
> Additionally, the ideas of Source and Destination play nicely with
> data lineage and may bring more interest to this feature of Airflow.
>
> Cheers,
> Tomek
>
>
> On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <ka...@gmail.com> wrote:
> >
> > Nice. Just a note here, we will need to make sure that those "Source" and
> > "Destination" needs to be serializable.
> >
> > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <da...@gmail.com>
> > wrote:
> >
> > > Interesting! Beam also could potentially allow transfers within
> Dask/any
> > > other system with a java/python SDK? I think @jarek and Polidea do a
> lot of
> > > work with Beam as well so I’d love their thoughts if this a good
> use-case.
> > >
> > > via Newton Mail [
> > >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > ]
> > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> gcasassaez@twitter.com.invalid>
> > > wrote:
> > > I would be highly in favour of having a generic Beam operator. Similar
> > > to @spark_task decorator. Something where you can easily define and
> wrap a
> > > beam pipeline and convert it to an Airflow operator.
> > >
> > > Gerard Casas Saez
> > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > >
> > >
> > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > > whatwouldaustindo@gmail.com>
> > > wrote:
> > >
> > > > Are you guys familiar with Beam <https://beam.apache.org>? Esp. if
> not
> > > > doing transforms, it might rather straightforward to rely on the
> > > ecosystem
> > > > of connectors in that Apache Project to use as the foundations for a
> > > > generic transfer operator.
> > > >
> > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> Jarek.Potiuk@polidea.com>
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > > > > kamil.olszewski@polidea.com>
> > > > > wrote:
> > > > >
> > > > > > Hello all,
> > > > > > since there have been no new comments shared in the POC doc
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > > > > >
> > > > > > for a couple of days, then I will proceed with creating an AIP
> for
> > > this
> > > > > > feature, if that is ok with everybody.
> > > > > > Best regards,
> > > > > > Kamil
> > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> > > turbaszek@apache.org
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I like the approach as it itnroduces another interesting
> operators'
> > > > > > > interface standarization. It would be awesome to here more
> opinions
> > > > :)
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Tomek
> > > > > > >
> > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > > > Jarek.Potiuk@polidea.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I like the idea a lot. Similar things have been discussed
> before
> > > > but
> > > > > > the
> > > > > > > > proposal is I think rather pragmatic and solves a real
> problem
> > > (and
> > > > > it
> > > > > > > does
> > > > > > > > not seem to be too complex to implement)
> > > > > > > >
> > > > > > > > There is some discussion about it already in the document
> (please
> > > > > > > chime-in
> > > > > > > > for those interested) but here a few points why I like it:
> > > > > > > >
> > > > > > > > - performance and optimization is not a focus for that. For
> > > generic
> > > > > > stuff
> > > > > > > > it is usually to write "optimal" solution but once you admit
> you
> > > > are
> > > > > > not
> > > > > > > > going to focus for optimisation, you come with simpler and
> easier
> > > > to
> > > > > > use
> > > > > > > > solutions
> > > > > > > >
> > > > > > > > - on the other hand - it uses very "Python'y" approach with
> using
> > > > > > > > Airflow's familiar concepts (connection, transfer) and has
> the
> > > > > > potential
> > > > > > > of
> > > > > > > > plugging in into 100s of hooks we have already easily -
> > > leveraging
> > > > > all
> > > > > > > the
> > > > > > > > "providers" richness of Airflow.
> > > > > > > >
> > > > > > > > - it aims to be easy to do "quick start" - if you have a
> number
> > > of
> > > > > > > > different sources/targets and as a data scientist you would
> like
> > > to
> > > > > > > quickly
> > > > > > > > start transferring data between them - you can do it easily
> with
> > > > > only
> > > > > > > > basic python knowledge and simple DAG structure.
> > > > > > > >
> > > > > > > > - it should be possible to plug it in into our new functional
> > > > > approach
> > > > > > as
> > > > > > > > well as future lineage discussions as it makes connection
> between
> > > > > > sources
> > > > > > > > and targets
> > > > > > > >
> > > > > > > > - it opens up possibilities of adding simple and flexible
> data
> > > > > > > > transformation on-transfer. Not a replacement for any of the
> > > > external
> > > > > > > > services that Airflow should use (Airflow is an
> orchestrator, not
> > > > > data
> > > > > > > > processing solution) but for the kind of quick-start
> scenarios I
> > > > > > foresee
> > > > > > > it
> > > > > > > > might be most useful, being able to apply simple data
> > > > transformation
> > > > > on
> > > > > > > the
> > > > > > > > fly by data scientist might be a big plus.
> > > > > > > >
> > > > > > > > Suggestion: Panda DataFrame as the format of the "data"
> component
> > > > > > > >
> > > > > > > > Kamil - you should have access now.
> > > > > > > >
> > > > > > > > J.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > > > > > > kamil.olszewski@polidea.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello all,
> > > > > > > > > in Polidea we have come up with an idea for a generic
> transfer
> > > > > > operator
> > > > > > > > > that would be able to transport data between two
> destinations
> > > of
> > > > > > > various
> > > > > > > > > types (file, database, storage, etc.) - please find the
> link
> > > > with a
> > > > > > > short
> > > > > > > > > doc with POC
> > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > > > > > > >
> > > > > > > > > where we can discuss the design initially. Once we come to
> the
> > > > > > initial
> > > > > > > > > conclusion I can create an AIP on cWiki - can I ask for
> > > > permission
> > > > > to
> > > > > > > do
> > > > > > > > so
> > > > > > > > > (my id is 'kamil.olszewski')? I believe that during the
> > > > discussion
> > > > > we
> > > > > > > > > should definitely aim for this feature to be released only
> > > after
> > > > > > > Airflow
> > > > > > > > > 2.0 is out.
> > > > > > > > >
> > > > > > > > > What do you think about this idea? Would you find such an
> > > > operator
> > > > > > > > helpful
> > > > > > > > > in your pipelines? Maybe you already use a similar
> solution or
> > > > know
> > > > > > > > > packages that could be used to implement it?
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Kamil Olszewski
> > > > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > > > > > >
> > > > > > > > > M: +48 503 361 783
> > > > > > > > > E: kamil.olszewski@polidea.com
> > > > > > > > >
> > > > > > > > > Unique Tech
> > > > > > > > > Check out our projects! <https://www.polidea.com/our-work>
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Jarek Potiuk
> > > > > > > > Polidea <https://www.polidea.com/> | Principal Software
> Engineer
> > > > > > > >
> > > > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Kamil Olszewski
> > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > > >
> > > > > > M: +48 503 361 783
> > > > > > E: kamil.olszewski@polidea.com
> > > > > >
> > > > > > Unique Tech
> > > > > > Check out our projects! <https://www.polidea.com/our-work>
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >
> > > >
>
>
>
> --
>
> Tomasz Urbaszek
> Polidea | Software Engineer
>
> M: +48 505 628 493
> E: tomasz.urbaszek@polidea.com
>
> Unique Tech
> Check out our projects!
>

Re: Generic Transfer Operator

Posted by Tomasz Urbaszek <tu...@apache.org>.
Daniel is right we have few Apache Beam committers in Polidea so we
will ask for advice. However, I would be highly in favor of having it
as Gerard suggested as @beam decorator. This is something we should
put into another AIP together with the mentioned @spark decorator.

Our proposition of transfer operators was mainly to create something
Airflow-native that works out of the box and allows us to simplify
read/write from external sources. Thus, it requires no external
dependency other than the library to communicate with the API. In the
case of Beam we need more than that I think.

Additionally, the ideas of Source and Destination play nicely with
data lineage and may bring more interest to this feature of Airflow.

Cheers,
Tomek


On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <ka...@gmail.com> wrote:
>
> Nice. Just a note here, we will need to make sure that those "Source" and
> "Destination" needs to be serializable.
>
> On Tue, Sep 1, 2020, 20:00 Daniel Imberman <da...@gmail.com>
> wrote:
>
> > Interesting! Beam also could potentially allow transfers within Dask/any
> > other system with a java/python SDK? I think @jarek and Polidea do a lot of
> > work with Beam as well so I’d love their thoughts if this a good use-case.
> >
> > via Newton Mail [
> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > ]
> > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <gc...@twitter.com.invalid>
> > wrote:
> > I would be highly in favour of having a generic Beam operator. Similar
> > to @spark_task decorator. Something where you can easily define and wrap a
> > beam pipeline and convert it to an Airflow operator.
> >
> > Gerard Casas Saez
> > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> >
> >
> > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > whatwouldaustindo@gmail.com>
> > wrote:
> >
> > > Are you guys familiar with Beam <https://beam.apache.org>? Esp. if not
> > > doing transforms, it might rather straightforward to rely on the
> > ecosystem
> > > of connectors in that Apache Project to use as the foundations for a
> > > generic transfer operator.
> > >
> > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <Ja...@polidea.com>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > > > kamil.olszewski@polidea.com>
> > > > wrote:
> > > >
> > > > > Hello all,
> > > > > since there have been no new comments shared in the POC doc
> > > > > <
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > > > >
> > > > > for a couple of days, then I will proceed with creating an AIP for
> > this
> > > > > feature, if that is ok with everybody.
> > > > > Best regards,
> > > > > Kamil
> > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> > turbaszek@apache.org
> > > >
> > > > > wrote:
> > > > >
> > > > > > I like the approach as it itnroduces another interesting operators'
> > > > > > interface standarization. It would be awesome to here more opinions
> > > :)
> > > > > >
> > > > > > Cheers,
> > > > > > Tomek
> > > > > >
> > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > > Jarek.Potiuk@polidea.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I like the idea a lot. Similar things have been discussed before
> > > but
> > > > > the
> > > > > > > proposal is I think rather pragmatic and solves a real problem
> > (and
> > > > it
> > > > > > does
> > > > > > > not seem to be too complex to implement)
> > > > > > >
> > > > > > > There is some discussion about it already in the document (please
> > > > > > chime-in
> > > > > > > for those interested) but here a few points why I like it:
> > > > > > >
> > > > > > > - performance and optimization is not a focus for that. For
> > generic
> > > > > stuff
> > > > > > > it is usually to write "optimal" solution but once you admit you
> > > are
> > > > > not
> > > > > > > going to focus for optimisation, you come with simpler and easier
> > > to
> > > > > use
> > > > > > > solutions
> > > > > > >
> > > > > > > - on the other hand - it uses very "Python'y" approach with using
> > > > > > > Airflow's familiar concepts (connection, transfer) and has the
> > > > > potential
> > > > > > of
> > > > > > > plugging in into 100s of hooks we have already easily -
> > leveraging
> > > > all
> > > > > > the
> > > > > > > "providers" richness of Airflow.
> > > > > > >
> > > > > > > - it aims to be easy to do "quick start" - if you have a number
> > of
> > > > > > > different sources/targets and as a data scientist you would like
> > to
> > > > > > quickly
> > > > > > > start transferring data between them - you can do it easily with
> > > > only
> > > > > > > basic python knowledge and simple DAG structure.
> > > > > > >
> > > > > > > - it should be possible to plug it in into our new functional
> > > > approach
> > > > > as
> > > > > > > well as future lineage discussions as it makes connection between
> > > > > sources
> > > > > > > and targets
> > > > > > >
> > > > > > > - it opens up possibilities of adding simple and flexible data
> > > > > > > transformation on-transfer. Not a replacement for any of the
> > > external
> > > > > > > services that Airflow should use (Airflow is an orchestrator, not
> > > > data
> > > > > > > processing solution) but for the kind of quick-start scenarios I
> > > > > foresee
> > > > > > it
> > > > > > > might be most useful, being able to apply simple data
> > > transformation
> > > > on
> > > > > > the
> > > > > > > fly by data scientist might be a big plus.
> > > > > > >
> > > > > > > Suggestion: Panda DataFrame as the format of the "data" component
> > > > > > >
> > > > > > > Kamil - you should have access now.
> > > > > > >
> > > > > > > J.
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > > > > > kamil.olszewski@polidea.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > > in Polidea we have come up with an idea for a generic transfer
> > > > > operator
> > > > > > > > that would be able to transport data between two destinations
> > of
> > > > > > various
> > > > > > > > types (file, database, storage, etc.) - please find the link
> > > with a
> > > > > > short
> > > > > > > > doc with POC
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > > > > > >
> > > > > > > > where we can discuss the design initially. Once we come to the
> > > > > initial
> > > > > > > > conclusion I can create an AIP on cWiki - can I ask for
> > > permission
> > > > to
> > > > > > do
> > > > > > > so
> > > > > > > > (my id is 'kamil.olszewski')? I believe that during the
> > > discussion
> > > > we
> > > > > > > > should definitely aim for this feature to be released only
> > after
> > > > > > Airflow
> > > > > > > > 2.0 is out.
> > > > > > > >
> > > > > > > > What do you think about this idea? Would you find such an
> > > operator
> > > > > > > helpful
> > > > > > > > in your pipelines? Maybe you already use a similar solution or
> > > know
> > > > > > > > packages that could be used to implement it?
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > --
> > > > > > > >
> > > > > > > > Kamil Olszewski
> > > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > > > > >
> > > > > > > > M: +48 503 361 783
> > > > > > > > E: kamil.olszewski@polidea.com
> > > > > > > >
> > > > > > > > Unique Tech
> > > > > > > > Check out our projects! <https://www.polidea.com/our-work>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Jarek Potiuk
> > > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > > > >
> > > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Kamil Olszewski
> > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > >
> > > > > M: +48 503 361 783
> > > > > E: kamil.olszewski@polidea.com
> > > > >
> > > > > Unique Tech
> > > > > Check out our projects! <https://www.polidea.com/our-work>
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > >



-- 

Tomasz Urbaszek
Polidea | Software Engineer

M: +48 505 628 493
E: tomasz.urbaszek@polidea.com

Unique Tech
Check out our projects!

Re: Generic Transfer Operator

Posted by Kaxil Naik <ka...@gmail.com>.
Nice. Just a note here, we will need to make sure that those "Source" and
"Destination" needs to be serializable.

On Tue, Sep 1, 2020, 20:00 Daniel Imberman <da...@gmail.com>
wrote:

> Interesting! Beam also could potentially allow transfers within Dask/any
> other system with a java/python SDK? I think @jarek and Polidea do a lot of
> work with Beam as well so I’d love their thoughts if this a good use-case.
>
> via Newton Mail [
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> ]
> On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <gc...@twitter.com.invalid>
> wrote:
> I would be highly in favour of having a generic Beam operator. Similar
> to @spark_task decorator. Something where you can easily define and wrap a
> beam pipeline and convert it to an Airflow operator.
>
> Gerard Casas Saez
> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>
>
> On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> whatwouldaustindo@gmail.com>
> wrote:
>
> > Are you guys familiar with Beam <https://beam.apache.org>? Esp. if not
> > doing transforms, it might rather straightforward to rely on the
> ecosystem
> > of connectors in that Apache Project to use as the foundations for a
> > generic transfer operator.
> >
> > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <Ja...@polidea.com>
> > wrote:
> >
> > > +1
> > >
> > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > > kamil.olszewski@polidea.com>
> > > wrote:
> > >
> > > > Hello all,
> > > > since there have been no new comments shared in the POC doc
> > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > > >
> > > > for a couple of days, then I will proceed with creating an AIP for
> this
> > > > feature, if that is ok with everybody.
> > > > Best regards,
> > > > Kamil
> > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> turbaszek@apache.org
> > >
> > > > wrote:
> > > >
> > > > > I like the approach as it itnroduces another interesting operators'
> > > > > interface standarization. It would be awesome to here more opinions
> > :)
> > > > >
> > > > > Cheers,
> > > > > Tomek
> > > > >
> > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > Jarek.Potiuk@polidea.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > I like the idea a lot. Similar things have been discussed before
> > but
> > > > the
> > > > > > proposal is I think rather pragmatic and solves a real problem
> (and
> > > it
> > > > > does
> > > > > > not seem to be too complex to implement)
> > > > > >
> > > > > > There is some discussion about it already in the document (please
> > > > > chime-in
> > > > > > for those interested) but here a few points why I like it:
> > > > > >
> > > > > > - performance and optimization is not a focus for that. For
> generic
> > > > stuff
> > > > > > it is usually to write "optimal" solution but once you admit you
> > are
> > > > not
> > > > > > going to focus for optimisation, you come with simpler and easier
> > to
> > > > use
> > > > > > solutions
> > > > > >
> > > > > > - on the other hand - it uses very "Python'y" approach with using
> > > > > > Airflow's familiar concepts (connection, transfer) and has the
> > > > potential
> > > > > of
> > > > > > plugging in into 100s of hooks we have already easily -
> leveraging
> > > all
> > > > > the
> > > > > > "providers" richness of Airflow.
> > > > > >
> > > > > > - it aims to be easy to do "quick start" - if you have a number
> of
> > > > > > different sources/targets and as a data scientist you would like
> to
> > > > > quickly
> > > > > > start transferring data between them - you can do it easily with
> > > only
> > > > > > basic python knowledge and simple DAG structure.
> > > > > >
> > > > > > - it should be possible to plug it in into our new functional
> > > approach
> > > > as
> > > > > > well as future lineage discussions as it makes connection between
> > > > sources
> > > > > > and targets
> > > > > >
> > > > > > - it opens up possibilities of adding simple and flexible data
> > > > > > transformation on-transfer. Not a replacement for any of the
> > external
> > > > > > services that Airflow should use (Airflow is an orchestrator, not
> > > data
> > > > > > processing solution) but for the kind of quick-start scenarios I
> > > > foresee
> > > > > it
> > > > > > might be most useful, being able to apply simple data
> > transformation
> > > on
> > > > > the
> > > > > > fly by data scientist might be a big plus.
> > > > > >
> > > > > > Suggestion: Panda DataFrame as the format of the "data" component
> > > > > >
> > > > > > Kamil - you should have access now.
> > > > > >
> > > > > > J.
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > > > > kamil.olszewski@polidea.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello all,
> > > > > > > in Polidea we have come up with an idea for a generic transfer
> > > > operator
> > > > > > > that would be able to transport data between two destinations
> of
> > > > > various
> > > > > > > types (file, database, storage, etc.) - please find the link
> > with a
> > > > > short
> > > > > > > doc with POC
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > > > > >
> > > > > > > where we can discuss the design initially. Once we come to the
> > > > initial
> > > > > > > conclusion I can create an AIP on cWiki - can I ask for
> > permission
> > > to
> > > > > do
> > > > > > so
> > > > > > > (my id is 'kamil.olszewski')? I believe that during the
> > discussion
> > > we
> > > > > > > should definitely aim for this feature to be released only
> after
> > > > > Airflow
> > > > > > > 2.0 is out.
> > > > > > >
> > > > > > > What do you think about this idea? Would you find such an
> > operator
> > > > > > helpful
> > > > > > > in your pipelines? Maybe you already use a similar solution or
> > know
> > > > > > > packages that could be used to implement it?
> > > > > > >
> > > > > > > Best regards,
> > > > > > > --
> > > > > > >
> > > > > > > Kamil Olszewski
> > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > > > >
> > > > > > > M: +48 503 361 783
> > > > > > > E: kamil.olszewski@polidea.com
> > > > > > >
> > > > > > > Unique Tech
> > > > > > > Check out our projects! <https://www.polidea.com/our-work>
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Jarek Potiuk
> > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > > >
> > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Kamil Olszewski
> > > > Polidea <https://www.polidea.com> | Software Engineer
> > > >
> > > > M: +48 503 361 783
> > > > E: kamil.olszewski@polidea.com
> > > >
> > > > Unique Tech
> > > > Check out our projects! <https://www.polidea.com/our-work>
> > > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> >

Re: Generic Transfer Operator

Posted by Daniel Imberman <da...@gmail.com>.
Interesting! Beam also could potentially allow transfers within Dask/any other system with a java/python SDK? I think @jarek and Polidea do a lot of work with Beam as well so I’d love their thoughts if this a good use-case.

via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <gc...@twitter.com.invalid> wrote:
I would be highly in favour of having a generic Beam operator. Similar
to @spark_task decorator. Something where you can easily define and wrap a
beam pipeline and convert it to an Airflow operator.

Gerard Casas Saez
Twitter | Cortex | @casassaez <http://twitter.com/casassaez>


On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <wh...@gmail.com>
wrote:

> Are you guys familiar with Beam <https://beam.apache.org>? Esp. if not
> doing transforms, it might rather straightforward to rely on the ecosystem
> of connectors in that Apache Project to use as the foundations for a
> generic transfer operator.
>
> On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
> > +1
> >
> > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > kamil.olszewski@polidea.com>
> > wrote:
> >
> > > Hello all,
> > > since there have been no new comments shared in the POC doc
> > > <
> > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > >
> > > for a couple of days, then I will proceed with creating an AIP for this
> > > feature, if that is ok with everybody.
> > > Best regards,
> > > Kamil
> > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <turbaszek@apache.org
> >
> > > wrote:
> > >
> > > > I like the approach as it itnroduces another interesting operators'
> > > > interface standarization. It would be awesome to here more opinions
> :)
> > > >
> > > > Cheers,
> > > > Tomek
> > > >
> > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> Jarek.Potiuk@polidea.com
> > >
> > > > wrote:
> > > >
> > > > > I like the idea a lot. Similar things have been discussed before
> but
> > > the
> > > > > proposal is I think rather pragmatic and solves a real problem (and
> > it
> > > > does
> > > > > not seem to be too complex to implement)
> > > > >
> > > > > There is some discussion about it already in the document (please
> > > > chime-in
> > > > > for those interested) but here a few points why I like it:
> > > > >
> > > > > - performance and optimization is not a focus for that. For generic
> > > stuff
> > > > > it is usually to write "optimal" solution but once you admit you
> are
> > > not
> > > > > going to focus for optimisation, you come with simpler and easier
> to
> > > use
> > > > > solutions
> > > > >
> > > > > - on the other hand - it uses very "Python'y" approach with using
> > > > > Airflow's familiar concepts (connection, transfer) and has the
> > > potential
> > > > of
> > > > > plugging in into 100s of hooks we have already easily - leveraging
> > all
> > > > the
> > > > > "providers" richness of Airflow.
> > > > >
> > > > > - it aims to be easy to do "quick start" - if you have a number of
> > > > > different sources/targets and as a data scientist you would like to
> > > > quickly
> > > > > start transferring data between them - you can do it easily with
> > only
> > > > > basic python knowledge and simple DAG structure.
> > > > >
> > > > > - it should be possible to plug it in into our new functional
> > approach
> > > as
> > > > > well as future lineage discussions as it makes connection between
> > > sources
> > > > > and targets
> > > > >
> > > > > - it opens up possibilities of adding simple and flexible data
> > > > > transformation on-transfer. Not a replacement for any of the
> external
> > > > > services that Airflow should use (Airflow is an orchestrator, not
> > data
> > > > > processing solution) but for the kind of quick-start scenarios I
> > > foresee
> > > > it
> > > > > might be most useful, being able to apply simple data
> transformation
> > on
> > > > the
> > > > > fly by data scientist might be a big plus.
> > > > >
> > > > > Suggestion: Panda DataFrame as the format of the "data" component
> > > > >
> > > > > Kamil - you should have access now.
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > > > kamil.olszewski@polidea.com>
> > > > > wrote:
> > > > >
> > > > > > Hello all,
> > > > > > in Polidea we have come up with an idea for a generic transfer
> > > operator
> > > > > > that would be able to transport data between two destinations of
> > > > various
> > > > > > types (file, database, storage, etc.) - please find the link
> with a
> > > > short
> > > > > > doc with POC
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > > > >
> > > > > > where we can discuss the design initially. Once we come to the
> > > initial
> > > > > > conclusion I can create an AIP on cWiki - can I ask for
> permission
> > to
> > > > do
> > > > > so
> > > > > > (my id is 'kamil.olszewski')? I believe that during the
> discussion
> > we
> > > > > > should definitely aim for this feature to be released only after
> > > > Airflow
> > > > > > 2.0 is out.
> > > > > >
> > > > > > What do you think about this idea? Would you find such an
> operator
> > > > > helpful
> > > > > > in your pipelines? Maybe you already use a similar solution or
> know
> > > > > > packages that could be used to implement it?
> > > > > >
> > > > > > Best regards,
> > > > > > --
> > > > > >
> > > > > > Kamil Olszewski
> > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > > >
> > > > > > M: +48 503 361 783
> > > > > > E: kamil.olszewski@polidea.com
> > > > > >
> > > > > > Unique Tech
> > > > > > Check out our projects! <https://www.polidea.com/our-work>
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Kamil Olszewski
> > > Polidea <https://www.polidea.com> | Software Engineer
> > >
> > > M: +48 503 361 783
> > > E: kamil.olszewski@polidea.com
> > >
> > > Unique Tech
> > > Check out our projects! <https://www.polidea.com/our-work>
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>

Re: Generic Transfer Operator

Posted by Gerard Casas Saez <gc...@twitter.com.INVALID>.
I would be highly in favour of having a generic Beam operator. Similar
to @spark_task decorator. Something where you can easily define and wrap a
beam pipeline and convert it to an Airflow operator.

Gerard Casas Saez
Twitter | Cortex | @casassaez <http://twitter.com/casassaez>


On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <wh...@gmail.com>
wrote:

> Are you guys familiar with Beam <https://beam.apache.org>?  Esp. if not
> doing transforms, it might rather straightforward to rely on the ecosystem
> of connectors in that Apache Project to use as the foundations for a
> generic transfer operator.
>
> On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
> > +1
> >
> > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > kamil.olszewski@polidea.com>
> > wrote:
> >
> > > Hello all,
> > > since there have been no new comments shared in the POC doc
> > > <
> > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > >
> > > for a couple of days, then I will proceed with creating an AIP for this
> > > feature, if that is ok with everybody.
> > > Best regards,
> > > Kamil
> > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <turbaszek@apache.org
> >
> > > wrote:
> > >
> > > > I like the approach as it itnroduces another interesting operators'
> > > > interface standarization. It would be awesome to here more opinions
> :)
> > > >
> > > > Cheers,
> > > > Tomek
> > > >
> > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> Jarek.Potiuk@polidea.com
> > >
> > > > wrote:
> > > >
> > > > > I like the idea a lot. Similar things have been discussed before
> but
> > > the
> > > > > proposal is I think rather pragmatic and solves a real problem (and
> > it
> > > > does
> > > > > not seem to be too complex to implement)
> > > > >
> > > > > There is some discussion about it already in the document (please
> > > > chime-in
> > > > > for those interested) but here a few points why I like it:
> > > > >
> > > > > - performance and optimization is not a focus for that. For generic
> > > stuff
> > > > > it is usually to write "optimal" solution but once you admit you
> are
> > > not
> > > > > going to focus for optimisation, you come with simpler and easier
> to
> > > use
> > > > > solutions
> > > > >
> > > > > - on the other hand - it uses very "Python'y" approach with using
> > > > > Airflow's familiar concepts (connection, transfer) and has the
> > > potential
> > > > of
> > > > > plugging in into 100s of hooks we have already easily - leveraging
> > all
> > > > the
> > > > > "providers" richness of Airflow.
> > > > >
> > > > > - it aims to be easy to do "quick start" - if you have a number of
> > > > > different sources/targets and as a data scientist you would like to
> > > > quickly
> > > > > start transferring data between them  - you can do it easily with
> > only
> > > > > basic python knowledge and simple DAG structure.
> > > > >
> > > > > - it should be possible to plug it in into our new functional
> > approach
> > > as
> > > > > well as future lineage discussions as it makes connection between
> > > sources
> > > > > and targets
> > > > >
> > > > > - it opens up possibilities of adding simple and flexible data
> > > > > transformation on-transfer. Not a replacement for any of the
> external
> > > > > services that Airflow should use (Airflow is an orchestrator, not
> > data
> > > > > processing solution) but for the kind of quick-start scenarios I
> > > foresee
> > > > it
> > > > > might be most useful, being able to apply simple data
> transformation
> > on
> > > > the
> > > > > fly by data scientist might be a big plus.
> > > > >
> > > > > Suggestion: Panda DataFrame as the format of the "data" component
> > > > >
> > > > > Kamil - you should have access now.
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > > > kamil.olszewski@polidea.com>
> > > > > wrote:
> > > > >
> > > > > > Hello all,
> > > > > > in Polidea we have come up with an idea for a generic transfer
> > > operator
> > > > > > that would be able to transport data between two destinations of
> > > > various
> > > > > > types (file, database, storage, etc.) - please find the link
> with a
> > > > short
> > > > > > doc with POC
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > > > >
> > > > > > where we can discuss the design initially. Once we come to the
> > > initial
> > > > > > conclusion I can create an AIP on cWiki - can I ask for
> permission
> > to
> > > > do
> > > > > so
> > > > > > (my id is 'kamil.olszewski')? I believe that during the
> discussion
> > we
> > > > > > should definitely aim for this feature to be released only after
> > > > Airflow
> > > > > > 2.0 is out.
> > > > > >
> > > > > > What do you think about this idea? Would you find such an
> operator
> > > > > helpful
> > > > > > in your pipelines? Maybe you already use a similar solution or
> know
> > > > > > packages that could be used to implement it?
> > > > > >
> > > > > > Best regards,
> > > > > > --
> > > > > >
> > > > > > Kamil Olszewski
> > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > > >
> > > > > > M: +48 503 361 783
> > > > > > E: kamil.olszewski@polidea.com
> > > > > >
> > > > > > Unique Tech
> > > > > > Check out our projects! <https://www.polidea.com/our-work>
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Kamil Olszewski
> > > Polidea <https://www.polidea.com> | Software Engineer
> > >
> > > M: +48 503 361 783
> > > E: kamil.olszewski@polidea.com
> > >
> > > Unique Tech
> > > Check out our projects! <https://www.polidea.com/our-work>
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>

Re: Generic Transfer Operator

Posted by Austin Bennett <wh...@gmail.com>.
Are you guys familiar with Beam <https://beam.apache.org>?  Esp. if not
doing transforms, it might rather straightforward to rely on the ecosystem
of connectors in that Apache Project to use as the foundations for a
generic transfer operator.

On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <Ja...@polidea.com>
wrote:

> +1
>
> On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> kamil.olszewski@polidea.com>
> wrote:
>
> > Hello all,
> > since there have been no new comments shared in the POC doc
> > <
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > >
> > for a couple of days, then I will proceed with creating an AIP for this
> > feature, if that is ok with everybody.
> > Best regards,
> > Kamil
> > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <tu...@apache.org>
> > wrote:
> >
> > > I like the approach as it itnroduces another interesting operators'
> > > interface standarization. It would be awesome to here more opinions :)
> > >
> > > Cheers,
> > > Tomek
> > >
> > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <Jarek.Potiuk@polidea.com
> >
> > > wrote:
> > >
> > > > I like the idea a lot. Similar things have been discussed before but
> > the
> > > > proposal is I think rather pragmatic and solves a real problem (and
> it
> > > does
> > > > not seem to be too complex to implement)
> > > >
> > > > There is some discussion about it already in the document (please
> > > chime-in
> > > > for those interested) but here a few points why I like it:
> > > >
> > > > - performance and optimization is not a focus for that. For generic
> > stuff
> > > > it is usually to write "optimal" solution but once you admit you are
> > not
> > > > going to focus for optimisation, you come with simpler and easier to
> > use
> > > > solutions
> > > >
> > > > - on the other hand - it uses very "Python'y" approach with using
> > > > Airflow's familiar concepts (connection, transfer) and has the
> > potential
> > > of
> > > > plugging in into 100s of hooks we have already easily - leveraging
> all
> > > the
> > > > "providers" richness of Airflow.
> > > >
> > > > - it aims to be easy to do "quick start" - if you have a number of
> > > > different sources/targets and as a data scientist you would like to
> > > quickly
> > > > start transferring data between them  - you can do it easily with
> only
> > > > basic python knowledge and simple DAG structure.
> > > >
> > > > - it should be possible to plug it in into our new functional
> approach
> > as
> > > > well as future lineage discussions as it makes connection between
> > sources
> > > > and targets
> > > >
> > > > - it opens up possibilities of adding simple and flexible data
> > > > transformation on-transfer. Not a replacement for any of the external
> > > > services that Airflow should use (Airflow is an orchestrator, not
> data
> > > > processing solution) but for the kind of quick-start scenarios I
> > foresee
> > > it
> > > > might be most useful, being able to apply simple data transformation
> on
> > > the
> > > > fly by data scientist might be a big plus.
> > > >
> > > > Suggestion: Panda DataFrame as the format of the "data" component
> > > >
> > > > Kamil - you should have access now.
> > > >
> > > > J.
> > > >
> > > >
> > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > > kamil.olszewski@polidea.com>
> > > > wrote:
> > > >
> > > > > Hello all,
> > > > > in Polidea we have come up with an idea for a generic transfer
> > operator
> > > > > that would be able to transport data between two destinations of
> > > various
> > > > > types (file, database, storage, etc.) - please find the link with a
> > > short
> > > > > doc with POC
> > > > > <
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > > >
> > > > > where we can discuss the design initially. Once we come to the
> > initial
> > > > > conclusion I can create an AIP on cWiki - can I ask for permission
> to
> > > do
> > > > so
> > > > > (my id is 'kamil.olszewski')? I believe that during the discussion
> we
> > > > > should definitely aim for this feature to be released only after
> > > Airflow
> > > > > 2.0 is out.
> > > > >
> > > > > What do you think about this idea? Would you find such an operator
> > > > helpful
> > > > > in your pipelines? Maybe you already use a similar solution or know
> > > > > packages that could be used to implement it?
> > > > >
> > > > > Best regards,
> > > > > --
> > > > >
> > > > > Kamil Olszewski
> > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > >
> > > > > M: +48 503 361 783
> > > > > E: kamil.olszewski@polidea.com
> > > > >
> > > > > Unique Tech
> > > > > Check out our projects! <https://www.polidea.com/our-work>
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > >
> >
> >
> > --
> >
> > Kamil Olszewski
> > Polidea <https://www.polidea.com> | Software Engineer
> >
> > M: +48 503 361 783
> > E: kamil.olszewski@polidea.com
> >
> > Unique Tech
> > Check out our projects! <https://www.polidea.com/our-work>
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Re: Generic Transfer Operator

Posted by Jarek Potiuk <Ja...@polidea.com>.
+1

On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <ka...@polidea.com>
wrote:

> Hello all,
> since there have been no new comments shared in the POC doc
> <
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> >
> for a couple of days, then I will proceed with creating an AIP for this
> feature, if that is ok with everybody.
> Best regards,
> Kamil
> On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <tu...@apache.org>
> wrote:
>
> > I like the approach as it itnroduces another interesting operators'
> > interface standarization. It would be awesome to here more opinions :)
> >
> > Cheers,
> > Tomek
> >
> > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <Ja...@polidea.com>
> > wrote:
> >
> > > I like the idea a lot. Similar things have been discussed before but
> the
> > > proposal is I think rather pragmatic and solves a real problem (and it
> > does
> > > not seem to be too complex to implement)
> > >
> > > There is some discussion about it already in the document (please
> > chime-in
> > > for those interested) but here a few points why I like it:
> > >
> > > - performance and optimization is not a focus for that. For generic
> stuff
> > > it is usually to write "optimal" solution but once you admit you are
> not
> > > going to focus for optimisation, you come with simpler and easier to
> use
> > > solutions
> > >
> > > - on the other hand - it uses very "Python'y" approach with using
> > > Airflow's familiar concepts (connection, transfer) and has the
> potential
> > of
> > > plugging in into 100s of hooks we have already easily - leveraging all
> > the
> > > "providers" richness of Airflow.
> > >
> > > - it aims to be easy to do "quick start" - if you have a number of
> > > different sources/targets and as a data scientist you would like to
> > quickly
> > > start transferring data between them  - you can do it easily with only
> > > basic python knowledge and simple DAG structure.
> > >
> > > - it should be possible to plug it in into our new functional approach
> as
> > > well as future lineage discussions as it makes connection between
> sources
> > > and targets
> > >
> > > - it opens up possibilities of adding simple and flexible data
> > > transformation on-transfer. Not a replacement for any of the external
> > > services that Airflow should use (Airflow is an orchestrator, not data
> > > processing solution) but for the kind of quick-start scenarios I
> foresee
> > it
> > > might be most useful, being able to apply simple data transformation on
> > the
> > > fly by data scientist might be a big plus.
> > >
> > > Suggestion: Panda DataFrame as the format of the "data" component
> > >
> > > Kamil - you should have access now.
> > >
> > > J.
> > >
> > >
> > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > kamil.olszewski@polidea.com>
> > > wrote:
> > >
> > > > Hello all,
> > > > in Polidea we have come up with an idea for a generic transfer
> operator
> > > > that would be able to transport data between two destinations of
> > various
> > > > types (file, database, storage, etc.) - please find the link with a
> > short
> > > > doc with POC
> > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > >
> > > > where we can discuss the design initially. Once we come to the
> initial
> > > > conclusion I can create an AIP on cWiki - can I ask for permission to
> > do
> > > so
> > > > (my id is 'kamil.olszewski')? I believe that during the discussion we
> > > > should definitely aim for this feature to be released only after
> > Airflow
> > > > 2.0 is out.
> > > >
> > > > What do you think about this idea? Would you find such an operator
> > > helpful
> > > > in your pipelines? Maybe you already use a similar solution or know
> > > > packages that could be used to implement it?
> > > >
> > > > Best regards,
> > > > --
> > > >
> > > > Kamil Olszewski
> > > > Polidea <https://www.polidea.com> | Software Engineer
> > > >
> > > > M: +48 503 361 783
> > > > E: kamil.olszewski@polidea.com
> > > >
> > > > Unique Tech
> > > > Check out our projects! <https://www.polidea.com/our-work>
> > > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> >
>
>
> --
>
> Kamil Olszewski
> Polidea <https://www.polidea.com> | Software Engineer
>
> M: +48 503 361 783
> E: kamil.olszewski@polidea.com
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>