You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "Rémy SAISSY (JIRA)" <ji...@apache.org> on 2015/09/17 12:01:46 UTC

[jira] [Comment Edited] (ATLAS-164) DFS addon for Atlas

    [ https://issues.apache.org/jira/browse/ATLAS-164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802696#comment-14802696 ] 

Rémy SAISSY edited comment on ATLAS-164 at 9/17/15 10:01 AM:
-------------------------------------------------------------

Hi Venkatesh,
thanks :).

* DfsDataModel 
I agree, at first I started by considering three classes: file, dir and symlink.
I reverted back to 1:1 mapping because handling symlink required to use two different properties wether I had a file or a directory target. I thought that it would not be an issue to map inodes since the query language enables to show files, dirs and symlinks separately.

A question, can we model class inheritance? If so, I could have dir,file and symlink classes to inherit from inode and provide a clean symlink_target attribute with the parent class as the type.

* Import

Thanks for the pointer, I will check how falcon does it. Appart from the technical standpoint, I will also document myself a bit one regulatory needs since implementing as data sets is that it will reduce the granularity thus maybe it might not be precise enough for some regulatory needs.
Also, I see two approaches to data sets:
 - one that requires to manually define data sets using the webapp so the bridge will log only those data sets (and forget about the other events on HDFS)
 - one that consider that a data set is a non-recursive directory. Any action on a file will log an event for its directory

The latter has the advantage to process all actions in HDFS and to be easier to configure and use for the end user so I would prefer it.

* Lineage

This is because I haven't yet fully understood how lineage should be handled in by Atlas addons.
 - should I also keep track of who executed what action on a data set / file / dir / symlink? I haven't seen support for it in the hive-bridge but I guess it is required to comply with regulatory needs.

Speaking about set of files consumed by a PIG,MR,Spark or whatever job, since HDFS sees actions as they happen, I see two approaches:
 - HDFS level: considering a data set as being a non-recursive directory. That would be a lot of events but all for the same node in Atlas (the source / target directory of the job)
 - processing framework level: hook an addon for each framework that log events into atlas on the same data as the hdfs bridge ones.

--> I prefer doing it at the HDFS level only. It is more generic.

* Unit Tests

I've made a typo, I meant the integration test.



was (Author: rémy):
Hi Venkatesh,
thanks.

* DfsDataModel 
I agree, at first I started by considering three classes: file, dir and symlink.
I reverted back to 1:1 mapping because handling symlink required to use two different properties wether I had a file or a directory target. I thought that it would not be an issue to map inodes since the query language enables to show files, dirs and symlinks separately.

A question, can we model class inheritance? If so, I could have dir,file and symlink classes to inherit from inode and provide a clean symlink_target attribute with the parent class as the type.

* Import

Thanks for the pointer, I will check how falcon does it. Appart from the technical standpoint, I will also document myself a bit one regulatory needs since implementing as data sets is that it will reduce the granularity thus maybe it might not be precise enough for some regulatory needs.
Also, I see two approaches to data sets:
 - one that requires to manually define data sets using the webapp so the bridge will log only those data sets (and forget about the other events on HDFS)
 - one that consider that a data set is a non-recursive directory. Any action on a file will log an event for its directory

The latter has the advantage to process all actions in HDFS and to be easier to configure and use for the end user so I would prefer it.

* Lineage

This is because I haven't yet fully understood how lineage should be handled in by Atlas addons.
 - should I also keep track of who executed what action on a data set / file / dir / symlink? I haven't seen support for it in the hive-bridge but I guess it is required to comply with regulatory needs.

Speaking about set of files consumed by a PIG,MR,Spark or whatever job, since HDFS sees actions as they happen, I see two approaches:
 - HDFS level: considering a data set as being a non-recursive directory. That would be a lot of events but all for the same node in Atlas (the source / target directory of the job)
 - processing framework level: hook an addon for each framework that log events into atlas on the same data as the hdfs bridge ones.

--> I prefer doing it at the HDFS level only. It is more generic.

* Unit Tests

I've made a typo, I meant the integration test.


> DFS addon for Atlas
> -------------------
>
>                 Key: ATLAS-164
>                 URL: https://issues.apache.org/jira/browse/ATLAS-164
>             Project: Atlas
>          Issue Type: New Feature
>    Affects Versions: 0.6-incubating
>            Reporter: Rémy SAISSY
>            Assignee: Rémy SAISSY
>         Attachments: ATLAS-164.15092015.patch, ATLAS-164.15092015.patch
>
>
> Hi,
> I have wrote an addon for sending DFS metadata into Atlas.
> The patch is attached.
> However, I have a hard time getting the unit tests working properly thus some advices would be welcome.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)