You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@atlas.apache.org by Bolke de Bruin <bd...@gmail.com> on 2018/05/01 16:00:56 UTC

Integration with Apache Airflow

Hi There.

I’m one of the maintainers of Apache Airflow (incubating). Airflow is a task based orchestration (workflow) engine. A long time wish of our community is to have lineage information available. I have been tinkering with integration Apache Atlas for a couple of days now, but some questions remain. I hope you can help.

We have a concept called “Operators” which translates to a Process most of the time. So for example we have a SFTPOperator which copies a file to a certain destination. We also have a SparkSubmitOperator. Here it becomes a bit gray to me. Basically what we do here is to kick off a Spark script. So what happens is that you can consider our Operator to have the exact same inputs and outputs as the Spark script seen from data perspective, it also performs the same processing. There are a couple of challenges:

1. In case Spark does emit lineage information (I know that is being worked on) our SparkSubmitOperator basically emits redundant information. However, it is unknown to the Operator if Spark does emit lineage information and often it is not the case (older version) What is the best course action? If the Operator does emit lineage information does that make sense? How will this end up in the Atlas UI?

2. The conventions on what a Process accepts as inputs and supplies as outputs seem to be very loose and dependent on single type of systems (Spark, Flink, Whatever). They are also subject to change. So what do I supply as inputs to Spark and pickup from Spark what it considers its outputs? The qualifiedName for spark seems to be “ApplicationID.ExecutionId” (at the moment). How do I get that information *outside* of Spark?

3. What do you consider the best way of integration? (Related to #1, #2) Do we live in our own ecosystem and define our own Processes and Entities etc? Or do we integrate somewhere on the model level? 

Im the first to admit that I could be misunderstanding many things here. But I hope you like to educate me here :).

Regards,

Bolke




Re: Integration with Apache Airflow

Posted by Ernie Ostic <eo...@us.ibm.com>.
Hi Bolke...

I've been looking at similar requirements for lineage for somewhat
"out-of-the-usual-scope" kinds of data flow patterns, as well as generic
ETL.   Working with others here on the Apache Atlas team, the momentum
behind the Open Metadata initiative, and OMRS in particular, seems like the
right place to put an effort to defining processes, as opposed to defining
your own new types and custom ecosystem.    I've yet to dive into it
headfirst, but there is a "Process" area within the Open Metadata model
definitions that would probably be the best place to start.  Last I
checked, it isn't yet deeply defined, but these kinds of lineage examples
seem like they might be a good place to begin.

Some of the links to the Apache Atlas Wiki, for reference...

Open Metadata

https://cwiki.apache.org/confluence/display/ATLAS/Open+Metadata+and
+Governance

Open Metadata, type system

https://cwiki.apache.org/confluence/display/ATLAS/Building+out+the+Open
+Metadata+Typesystem

The models, and where you will see the outline for Processes, are at "Area
5"...

https://cwiki.apache.org/confluence/display/ATLAS/Area+5+-+Standards

Ernie






Ernie Ostic

InfoSphere Information Server
IBM Analytics

Cell: (617) 331 8238
---------------------------------------------------------------
Apache Atlas Update!
https://dsrealtime.wordpress.com/2017/11/16/apache-atlas-update-have-you-been-watching/

Open IGC is here!
https://dsrealtime.wordpress.com/2015/07/29/open-igc-is-here/



From:	Bolke de Bruin <bd...@gmail.com>
To:	user@atlas.apache.org
Date:	05/01/2018 12:01 PM
Subject:	Integration with Apache Airflow



Hi There.

I’m one of the maintainers of Apache Airflow (incubating). Airflow is a
task based orchestration (workflow) engine. A long time wish of our
community is to have lineage information available. I have been tinkering
with integration Apache Atlas for a couple of days now, but some questions
remain. I hope you can help.

We have a concept called “Operators” which translates to a Process most of
the time. So for example we have a SFTPOperator which copies a file to a
certain destination. We also have a SparkSubmitOperator. Here it becomes a
bit gray to me. Basically what we do here is to kick off a Spark script. So
what happens is that you can consider our Operator to have the exact same
inputs and outputs as the Spark script seen from data perspective, it also
performs the same processing. There are a couple of challenges:

1. In case Spark does emit lineage information (I know that is being worked
on) our SparkSubmitOperator basically emits redundant information. However,
it is unknown to the Operator if Spark does emit lineage information and
often it is not the case (older version) What is the best course action? If
the Operator does emit lineage information does that make sense? How will
this end up in the Atlas UI?

2. The conventions on what a Process accepts as inputs and supplies as
outputs seem to be very loose and dependent on single type of systems
(Spark, Flink, Whatever). They are also subject to change. So what do I
supply as inputs to Spark and pickup from Spark what it considers its
outputs? The qualifiedName for spark seems to be
“ApplicationID.ExecutionId” (at the moment). How do I get that information
*outside* of Spark?

3. What do you consider the best way of integration? (Related to #1, #2) Do
we live in our own ecosystem and define our own Processes and Entities etc?
Or do we integrate somewhere on the model level?

Im the first to admit that I could be misunderstanding many things here.
But I hope you like to educate me here :).

Regards,

Bolke