You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "Andrew Ahn (JIRA)" <ji...@apache.org> on 2015/10/08 19:25:26 UTC

[jira] [Updated] (ATLAS-184) Integrate Sqoop metadata into Atlas

     [ https://issues.apache.org/jira/browse/ATLAS-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Ahn updated ATLAS-184:
-----------------------------
    Description: 
Apache Sqoop Integration with Apache Atlas (incubating)
Introduction
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.

Apache Atlas is a metadata repository that enables end-to-end data lineage, search and associate business classification. 
Overview
The goal of this integration is to at minimum push the Sqoop generated query metadata along with the source provenance, target(s), and any available business context so Atlas can capture the lineage for this topology.

There are 2 parts in this process detailed below:
1.	Data model to represent the concepts in Sqoop
2.	Sqoop Bridge/Hook to update metadata in Atlas
Data Model
A data model is represented as a Type in Atlas. This can reuse or closely be modeled after Hive data types that already exist. At the least, we need to create types for:
•	Sqoop processes containing the SQL query text, start/end times, user, etc. 
•	Source Provenance, fine-grained at DB, Table, Column, etc. so we have a 1-1 mapping between source and target assets
•	Target (typically Hive, HBase, HDFS, etc.)

You can take a look at the data model code for Hive. Sqoop should reuse the data model from Hive or closely model after that.
Pushing Metadata into Atlas
There are 2 parts to the bridge:
1.	Sqoop Bridge 
This does not apply to Sqoop tool. However, will apply if and when we migrate to Sqoop 2.
2.	Post-execution Hook
Atlas needs to be notified when a new Sqoop Ingest is executed successfully or when someone changes the definition of an existing Sqoop Job.
You can refer to the hook code for Hive.
3.	Column-level lineage
It would be good to have column level lineage for data flowing from the source database/WH into Hive. 


> Integrate Sqoop metadata into Atlas
> -----------------------------------
>
>                 Key: ATLAS-184
>                 URL: https://issues.apache.org/jira/browse/ATLAS-184
>             Project: Atlas
>          Issue Type: Improvement
>    Affects Versions: 0.6-incubating
>            Reporter: Venkatesh Seetharam
>             Fix For: 0.6-incubating
>
>
> Apache Sqoop Integration with Apache Atlas (incubating)
> Introduction
> Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.
> Apache Atlas is a metadata repository that enables end-to-end data lineage, search and associate business classification. 
> Overview
> The goal of this integration is to at minimum push the Sqoop generated query metadata along with the source provenance, target(s), and any available business context so Atlas can capture the lineage for this topology.
> There are 2 parts in this process detailed below:
> 1.	Data model to represent the concepts in Sqoop
> 2.	Sqoop Bridge/Hook to update metadata in Atlas
> Data Model
> A data model is represented as a Type in Atlas. This can reuse or closely be modeled after Hive data types that already exist. At the least, we need to create types for:
> •	Sqoop processes containing the SQL query text, start/end times, user, etc. 
> •	Source Provenance, fine-grained at DB, Table, Column, etc. so we have a 1-1 mapping between source and target assets
> •	Target (typically Hive, HBase, HDFS, etc.)
> You can take a look at the data model code for Hive. Sqoop should reuse the data model from Hive or closely model after that.
> Pushing Metadata into Atlas
> There are 2 parts to the bridge:
> 1.	Sqoop Bridge 
> This does not apply to Sqoop tool. However, will apply if and when we migrate to Sqoop 2.
> 2.	Post-execution Hook
> Atlas needs to be notified when a new Sqoop Ingest is executed successfully or when someone changes the definition of an existing Sqoop Job.
> You can refer to the hook code for Hive.
> 3.	Column-level lineage
> It would be good to have column level lineage for data flowing from the source database/WH into Hive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)