You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@falcon.apache.org by Ajay Yadav <aj...@gmail.com> on 2015/07/15 10:32:37 UTC

Re: Need help in understanding Lineage With Falcon

+user mailing list(as it might be beneficial for other users also)

Hi Anuj,

Falcon(as of now)doesn't capture column level lineage. It only stores the
relationship between various feeds and processes (and their instances).
You can query only on those relationships. e.g. you can use falcon to query
information like

   - Who are all the consumers of this data?
   - Who are the downstream consumers(consuming the results of direct
   consumers) of this feed?
   - Which process instances consumed this particular instance of the feed?
   - What are all the processes in this pipeline?

More specifically to the example that you have provided, falcon isn't aware
that the Hive script used FirstName and LastName fields in the file and
used them to produce the FullName field. It only knows that this file is
the input for this Hive Script(if you are running through falcon process)
and these are the output tables.

Cheers
Ajay Yadava

On Wed, Jul 15, 2015 at 1:09 PM, anuj kumar <an...@gmail.com> wrote:

> Thanks Srikanth for the quick response.
> I understand that querying the Graph DB may not be a good solution after
> all.
>
> But what I understood is that this Graph DB stores already the calculated
> Lineage Information. Is this assumption correct?
> If yes, is there also some data store that captures the metadata
> information.
> So, for eg Lets assume I have a Hive Script that reads a file in HDFS,
> concatenates the FirstName and LastName fields in the file and stores it
> into a Hive Table as FullName field.
>
> Now in order to generate Lineage on the FullName field, it needs to not
> only know the HDFS File Name and the field Name but also the HIve Table
> Name as well as column Name.
>
> How does Falcon capture this metadata from the Hive Script? Does it parse
> the Hive Script to understand the metadata? Also, where is this metadata
> stored? Is it in HCatalog?
>
> May be I have completely misunderstood it.Please correct me if I am wrong.
>
> Thanks,
> Anuj Kumar
>
> On Wed, Jul 15, 2015 at 7:25 AM, Srikanth Sundarrajan <sriksun@hotmail.com
> >
> wrote:
>
> > Hi Anuj,
> >     Falcon stores lineage information in a graph store backed by a
> > blue-print api (by default is stored on titan db). So if one understands
> > the schema, one can query the graph, but we would not prefer anyone
> > accessing these graphs directly as they are internal representation of
> > falcon and are subject to change without any prior notice across
> releases.
> > Graph related RESTApi's in Falcon are modeled based on Rexster apis (
> > https://github.com/tinkerpop/rexster/wiki/Basic-REST-API). This should
> > allow any standard graph query to run over them.
> >
> > More specifically direct apis are available in the form of entity
> lineage (
> > http://falcon.apache.org/0.6.1/restapi/EntityDependencies.html &
> > http://falcon.apache.org/0.6.1/restapi/EntityLineage.html)  and instance
> > lineage (in trunk, pending release) for direct consumption without having
> > to write custom queries.
> >
> > Regards
> > Srikanth Sundarrajan
> >
> > > From: anuj.o.kumar@accenture.com
> > > To: dev@falcon.apache.org
> > > Subject: Need help in understanding Lineage With Falcon
> > > Date: Tue, 14 Jul 2015 22:51:46 +0000
> > >
> > > Hi,
> > >  I am working with a client that uses Informatica Metadata Manager to
> > visualise Lineage Information. Informatica Metadata Manager is currently
> > used at Data Warehouse layer and is proven effective.
> > > But unfortunately Informatica Metadata Manage does not have any
> > connectors to Hadoop to collect metadata information, which makes it not
> so
> > desirable tool for the entire end to end chain. This is where Apache
> Falcon
> > comes to the rescue.
> > >
> > > Looking at Falcon, I see that Falcon exposes a set of REST APIs that
> can
> > be used to capture metadata information about process,feed and cluster
> > entities (assuming that the workflow is scheduled using Apache Falcon).
> So
> > we are exploring option on how we can actually generate metadata at
> Hadoop
> > layer that can then be used to feed informatica Metadata Manager, which
> > will combine it with its own metadata from DWH and Business reports to
> > provide a complete Lineage information.
> > >
> > > I have three specific question with regard to the above problem :
> > >
> > >
> > >   1.  Where is the Metadata Repository located for Apache Falcon? Is it
> > the config store on Hadoop or Hcatalog ?
> > >   2.  Is there a way to connect to this repository(for e.g.. via JDBC)
> ?
> > >   3.  What set of REST APIs can be called from outside of the Falcon
> > environment to capture the Metadata Information about the processes
> > scheduled using Falcon ? I looked at these<
> > http://falcon.apache.org/0.6.1/restapi/> set of REST APIs, which was a
> > start for me, but I got lost in the details.
> > >
> > > Your quick answer would be really appreciated.
> > >
> > > Thanks,
> > > Anuj Kumar
> > > Technology Architect - Emerging Technology Innovation group
> > > mobile: +31 6 30458915
> > > ITO Toren - Gustav Mahlerplein 90 - 1082MA Amsterdam
> > >              >
> > > accenture
> > >
> > > ________________________________
> > >
> > > This message is for the designated recipient only and may contain
> > privileged, proprietary, or otherwise confidential information. If you
> have
> > received it in error, please notify the sender immediately and delete the
> > original. Any other use of the e-mail by you is prohibited. Where allowed
> > by local law, electronic communications with Accenture and its
> affiliates,
> > including e-mail and instant messaging (including content), may be
> scanned
> > by our systems for the purposes of information security and assessment of
> > internal compliance with Accenture policy.
> > >
> >
> ______________________________________________________________________________________
> > >
> > > www.accenture.com
> >
> >
>
>
>
> --
> *Anuj Kumar*
>

Re: Need help in understanding Lineage With Falcon

Posted by Seetharam Venkatesh <ve...@innerzeal.com>.
Anuj, falcon computes data lineage and does not have column lineage.

The way to integrate with informatica is to export the contents of the
graph db and then import it into the tool using xconnect - but I think this
integration will not work OOTB but needs some love from the tool vendor.

On Wed, Jul 15, 2015 at 1:34 AM Ajay Yadav <aj...@gmail.com> wrote:

> +user mailing list(as it might be beneficial for other users also)
>
> Hi Anuj,
>
> Falcon(as of now)doesn't capture column level lineage. It only stores the
> relationship between various feeds and processes (and their instances).
> You can query only on those relationships. e.g. you can use falcon to
> query information like
>
>    - Who are all the consumers of this data?
>    - Who are the downstream consumers(consuming the results of direct
>    consumers) of this feed?
>    - Which process instances consumed this particular instance of the
>    feed?
>    - What are all the processes in this pipeline?
>
> More specifically to the example that you have provided, falcon isn't
> aware that the Hive script used FirstName and LastName fields in the file
> and used them to produce the FullName field. It only knows that this file
> is the input for this Hive Script(if you are running through falcon
> process) and these are the output tables.
>
> Cheers
> Ajay Yadava
>
> On Wed, Jul 15, 2015 at 1:09 PM, anuj kumar <an...@gmail.com>
> wrote:
>
>> Thanks Srikanth for the quick response.
>> I understand that querying the Graph DB may not be a good solution after
>> all.
>>
>> But what I understood is that this Graph DB stores already the calculated
>> Lineage Information. Is this assumption correct?
>> If yes, is there also some data store that captures the metadata
>> information.
>> So, for eg Lets assume I have a Hive Script that reads a file in HDFS,
>> concatenates the FirstName and LastName fields in the file and stores it
>> into a Hive Table as FullName field.
>>
>> Now in order to generate Lineage on the FullName field, it needs to not
>> only know the HDFS File Name and the field Name but also the HIve Table
>> Name as well as column Name.
>>
>> How does Falcon capture this metadata from the Hive Script? Does it parse
>> the Hive Script to understand the metadata? Also, where is this metadata
>> stored? Is it in HCatalog?
>>
>> May be I have completely misunderstood it.Please correct me if I am wrong.
>>
>> Thanks,
>> Anuj Kumar
>>
>> On Wed, Jul 15, 2015 at 7:25 AM, Srikanth Sundarrajan <
>> sriksun@hotmail.com>
>> wrote:
>>
>
>> > Hi Anuj,
>> >     Falcon stores lineage information in a graph store backed by a
>> > blue-print api (by default is stored on titan db). So if one understands
>> > the schema, one can query the graph, but we would not prefer anyone
>> > accessing these graphs directly as they are internal representation of
>> > falcon and are subject to change without any prior notice across
>> releases.
>> > Graph related RESTApi's in Falcon are modeled based on Rexster apis (
>> > https://github.com/tinkerpop/rexster/wiki/Basic-REST-API). This should
>> > allow any standard graph query to run over them.
>> >
>> > More specifically direct apis are available in the form of entity
>> lineage (
>> > http://falcon.apache.org/0.6.1/restapi/EntityDependencies.html &
>> > http://falcon.apache.org/0.6.1/restapi/EntityLineage.html)  and
>> instance
>> > lineage (in trunk, pending release) for direct consumption without
>> having
>> > to write custom queries.
>> >
>> > Regards
>> > Srikanth Sundarrajan
>> >
>> > > From: anuj.o.kumar@accenture.com
>> > > To: dev@falcon.apache.org
>> > > Subject: Need help in understanding Lineage With Falcon
>> > > Date: Tue, 14 Jul 2015 22:51:46 +0000
>> > >
>> > > Hi,
>> > >  I am working with a client that uses Informatica Metadata Manager to
>> > visualise Lineage Information. Informatica Metadata Manager is currently
>> > used at Data Warehouse layer and is proven effective.
>> > > But unfortunately Informatica Metadata Manage does not have any
>> > connectors to Hadoop to collect metadata information, which makes it
>> not so
>> > desirable tool for the entire end to end chain. This is where Apache
>> Falcon
>> > comes to the rescue.
>> > >
>> > > Looking at Falcon, I see that Falcon exposes a set of REST APIs that
>> can
>> > be used to capture metadata information about process,feed and cluster
>> > entities (assuming that the workflow is scheduled using Apache Falcon).
>> So
>> > we are exploring option on how we can actually generate metadata at
>> Hadoop
>> > layer that can then be used to feed informatica Metadata Manager, which
>> > will combine it with its own metadata from DWH and Business reports to
>> > provide a complete Lineage information.
>> > >
>> > > I have three specific question with regard to the above problem :
>> > >
>> > >
>> > >   1.  Where is the Metadata Repository located for Apache Falcon? Is
>> it
>> > the config store on Hadoop or Hcatalog ?
>> > >   2.  Is there a way to connect to this repository(for e.g.. via
>> JDBC) ?
>> > >   3.  What set of REST APIs can be called from outside of the Falcon
>> > environment to capture the Metadata Information about the processes
>> > scheduled using Falcon ? I looked at these<
>> > http://falcon.apache.org/0.6.1/restapi/> set of REST APIs, which was a
>> > start for me, but I got lost in the details.
>> > >
>> > > Your quick answer would be really appreciated.
>> > >
>> > > Thanks,
>> > > Anuj Kumar
>> > > Technology Architect - Emerging Technology Innovation group
>>
> > > mobile: +31 6 30458915
>>
>
>> > > ITO Toren - Gustav Mahlerplein 90 - 1082MA Amsterdam
>> > >              >
>> > > accenture
>> > >
>> > > ________________________________
>> > >
>> > > This message is for the designated recipient only and may contain
>> > privileged, proprietary, or otherwise confidential information. If you
>> have
>> > received it in error, please notify the sender immediately and delete
>> the
>> > original. Any other use of the e-mail by you is prohibited. Where
>> allowed
>> > by local law, electronic communications with Accenture and its
>> affiliates,
>> > including e-mail and instant messaging (including content), may be
>> scanned
>> > by our systems for the purposes of information security and assessment
>> of
>> > internal compliance with Accenture policy.
>> > >
>> >
>> ______________________________________________________________________________________
>> > >
>> > > www.accenture.com
>> >
>> >
>>
>>
>>
>> --
>> *Anuj Kumar*
>>
>

Re: Need help in understanding Lineage With Falcon

Posted by Seetharam Venkatesh <ve...@innerzeal.com>.
Anuj, falcon computes data lineage and does not have column lineage.

The way to integrate with informatica is to export the contents of the
graph db and then import it into the tool using xconnect - but I think this
integration will not work OOTB but needs some love from the tool vendor.

On Wed, Jul 15, 2015 at 1:34 AM Ajay Yadav <aj...@gmail.com> wrote:

> +user mailing list(as it might be beneficial for other users also)
>
> Hi Anuj,
>
> Falcon(as of now)doesn't capture column level lineage. It only stores the
> relationship between various feeds and processes (and their instances).
> You can query only on those relationships. e.g. you can use falcon to
> query information like
>
>    - Who are all the consumers of this data?
>    - Who are the downstream consumers(consuming the results of direct
>    consumers) of this feed?
>    - Which process instances consumed this particular instance of the
>    feed?
>    - What are all the processes in this pipeline?
>
> More specifically to the example that you have provided, falcon isn't
> aware that the Hive script used FirstName and LastName fields in the file
> and used them to produce the FullName field. It only knows that this file
> is the input for this Hive Script(if you are running through falcon
> process) and these are the output tables.
>
> Cheers
> Ajay Yadava
>
> On Wed, Jul 15, 2015 at 1:09 PM, anuj kumar <an...@gmail.com>
> wrote:
>
>> Thanks Srikanth for the quick response.
>> I understand that querying the Graph DB may not be a good solution after
>> all.
>>
>> But what I understood is that this Graph DB stores already the calculated
>> Lineage Information. Is this assumption correct?
>> If yes, is there also some data store that captures the metadata
>> information.
>> So, for eg Lets assume I have a Hive Script that reads a file in HDFS,
>> concatenates the FirstName and LastName fields in the file and stores it
>> into a Hive Table as FullName field.
>>
>> Now in order to generate Lineage on the FullName field, it needs to not
>> only know the HDFS File Name and the field Name but also the HIve Table
>> Name as well as column Name.
>>
>> How does Falcon capture this metadata from the Hive Script? Does it parse
>> the Hive Script to understand the metadata? Also, where is this metadata
>> stored? Is it in HCatalog?
>>
>> May be I have completely misunderstood it.Please correct me if I am wrong.
>>
>> Thanks,
>> Anuj Kumar
>>
>> On Wed, Jul 15, 2015 at 7:25 AM, Srikanth Sundarrajan <
>> sriksun@hotmail.com>
>> wrote:
>>
>
>> > Hi Anuj,
>> >     Falcon stores lineage information in a graph store backed by a
>> > blue-print api (by default is stored on titan db). So if one understands
>> > the schema, one can query the graph, but we would not prefer anyone
>> > accessing these graphs directly as they are internal representation of
>> > falcon and are subject to change without any prior notice across
>> releases.
>> > Graph related RESTApi's in Falcon are modeled based on Rexster apis (
>> > https://github.com/tinkerpop/rexster/wiki/Basic-REST-API). This should
>> > allow any standard graph query to run over them.
>> >
>> > More specifically direct apis are available in the form of entity
>> lineage (
>> > http://falcon.apache.org/0.6.1/restapi/EntityDependencies.html &
>> > http://falcon.apache.org/0.6.1/restapi/EntityLineage.html)  and
>> instance
>> > lineage (in trunk, pending release) for direct consumption without
>> having
>> > to write custom queries.
>> >
>> > Regards
>> > Srikanth Sundarrajan
>> >
>> > > From: anuj.o.kumar@accenture.com
>> > > To: dev@falcon.apache.org
>> > > Subject: Need help in understanding Lineage With Falcon
>> > > Date: Tue, 14 Jul 2015 22:51:46 +0000
>> > >
>> > > Hi,
>> > >  I am working with a client that uses Informatica Metadata Manager to
>> > visualise Lineage Information. Informatica Metadata Manager is currently
>> > used at Data Warehouse layer and is proven effective.
>> > > But unfortunately Informatica Metadata Manage does not have any
>> > connectors to Hadoop to collect metadata information, which makes it
>> not so
>> > desirable tool for the entire end to end chain. This is where Apache
>> Falcon
>> > comes to the rescue.
>> > >
>> > > Looking at Falcon, I see that Falcon exposes a set of REST APIs that
>> can
>> > be used to capture metadata information about process,feed and cluster
>> > entities (assuming that the workflow is scheduled using Apache Falcon).
>> So
>> > we are exploring option on how we can actually generate metadata at
>> Hadoop
>> > layer that can then be used to feed informatica Metadata Manager, which
>> > will combine it with its own metadata from DWH and Business reports to
>> > provide a complete Lineage information.
>> > >
>> > > I have three specific question with regard to the above problem :
>> > >
>> > >
>> > >   1.  Where is the Metadata Repository located for Apache Falcon? Is
>> it
>> > the config store on Hadoop or Hcatalog ?
>> > >   2.  Is there a way to connect to this repository(for e.g.. via
>> JDBC) ?
>> > >   3.  What set of REST APIs can be called from outside of the Falcon
>> > environment to capture the Metadata Information about the processes
>> > scheduled using Falcon ? I looked at these<
>> > http://falcon.apache.org/0.6.1/restapi/> set of REST APIs, which was a
>> > start for me, but I got lost in the details.
>> > >
>> > > Your quick answer would be really appreciated.
>> > >
>> > > Thanks,
>> > > Anuj Kumar
>> > > Technology Architect - Emerging Technology Innovation group
>>
> > > mobile: +31 6 30458915
>>
>
>> > > ITO Toren - Gustav Mahlerplein 90 - 1082MA Amsterdam
>> > >              >
>> > > accenture
>> > >
>> > > ________________________________
>> > >
>> > > This message is for the designated recipient only and may contain
>> > privileged, proprietary, or otherwise confidential information. If you
>> have
>> > received it in error, please notify the sender immediately and delete
>> the
>> > original. Any other use of the e-mail by you is prohibited. Where
>> allowed
>> > by local law, electronic communications with Accenture and its
>> affiliates,
>> > including e-mail and instant messaging (including content), may be
>> scanned
>> > by our systems for the purposes of information security and assessment
>> of
>> > internal compliance with Accenture policy.
>> > >
>> >
>> ______________________________________________________________________________________
>> > >
>> > > www.accenture.com
>> >
>> >
>>
>>
>>
>> --
>> *Anuj Kumar*
>>
>