You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@eagle.apache.org by Don Bosco Durai <bo...@apache.org> on 2016/07/15 21:00:13 UTC

Re: Apache Ranger integration for Audit Logs...

I have some spare time and was planning to work on this. If no one currently looking into this JIRA, then can you assign it to me? 

https://issues.apache.org/jira/browse/EAGLE-59


Thanks

Bosco


On 11/29/15, 8:43 PM, "Don Bosco Durai" <bo...@apache.org> wrote:

    Edward
    
    Thanks. I will look into HdfsAuditLogProcessorMain class.
    
    I will upload the sample files in the JIRA. 
    
    
    
    Thanks
    
    Bosco
    
    
    On 11/29/15, 7:56 PM, "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com> wrote:
    
    >One more thing, Bosco, could you please copy some sample hdfs audit log,
    >hbase log and hive log to here?
    >
    >I realize with Ranger data source, we probably still need some minor code
    >development as follows
    >1. Substitute existing eagle data source(raw hdfs audit log) with Ranger
    >data source, for example, in HdfsAuditLogProcessorMain, modify the code to
    >use different log deserializer
    >2. Ensure output of Ranger log deserializer is compatible to existing
    >eagle data source.
    >
    >With the above code change, we can automatically get all capabilities like
    >sensitivity data join, user hadoop command reassembly, hive query
    >semantics parsing etc.
    >
    >Thanks
    >Edward Zhang
    >
    >On 11/29/15, 18:52, "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com> wrote:
    >
    >>Hi Bosco,
    >>
    >>Thanks for creating this ticket. It is very helpful if EAGLE can use
    >>Ranger as data source and automatically get monitoring capability in 9
    >>Hadoop components.
    >>
    >>If a datasource is not from Kafka, and needs a lot of pre-processing, it
    >>is not trivial to integrate that data source.
    >>
    >>Ranger¹s data source should be uniform in syntax and the integration
    >>should be straightforward, if we have a uniform deserializer.
    >>
    >>I think we can document the steps of integrating a new datasource.
    >>
    >>Thanks
    >>Edward Zhang
    >>
    >>On 11/29/15, 12:00, "Don Bosco Durai" <bo...@apache.org> wrote:
    >>
    >>>Hi Eagle team
    >>>
    >>>I am excited to see all the activities on this project. I have created a
    >>>JIRA (https://issues.apache.org/jira/browse/EAGLE-59) to track the
    >>>integration with Apache Ranger.
    >>>
    >>>One way to integrate is for Ranger to send the audit logs in the same way
    >>>as native log format to Kafka. However, Ranger already is doing the
    >>>normalization of the audit format for all the components. So
    >>>reconstructing might not be a good way to go.
    >>>
    >>>I am still getting familiar with the internals of Apache Eagle, but if
    >>>someone can help me or document how a 3rd party source can be integrated
    >>>with Apache Eagle, then it will be great. Also, what is the change
    >>>required on the analytics side to support new data sources? E.g. If we
    >>>integrate with Ranger Audit Logs, we would get audit logs from around 9
    >>>components right away. How can we use it?
    >>>
    >>>If you are okay, I am willing to work on this JIRA.
    >>>
    >>>Thanks
    >>>
    >>>Bosco
    >>> 
    >>>
    >>
    >

Re: Apache Ranger integration for Audit Logs...

Posted by Don Bosco Durai <bo...@apache.org>.

Thanks, I will work on the “develop” branch.

I have couple of possible designs in my mind:

1. Enhance Ranger to write the audit logs to file in JSON format and use the LogFeeder from Apache Ambari  to send them to Kafka topic. From where a Storm topology can transform it to something Eagle can understand
2. Use Kafka log4j to write Ranger Audits directly to a Kafka topic. From there, use a Storm topology which can transform events into Eagle format
3. Writing first class Kafka destination for Ranger Audit Destination. 

With #1 option, we will have dependency with Ambari, but none of the Hadoop components needs to updated with dependencies of Kafka client libraries.

With #2 option, we don’t need another process to monitor and publish logs to Kafka. But we will have to copy the Kafka client libraries to each Hadoop component. But rely on the Kafka log4j to handle all error handlings, including non availability of Kafka brokers.

With #3, we will need to do some work on Ranger side to implement the destination. And also copy the Kafka client library files to each Hadoop component. However, first class Kafka destination will give a lot of flexibility in handling batch processing, availability (implicit store and forward), Kerberos and other flexibilities.

Regardless, the Storm topology on the Eagle side to parse Ranger logs will be the same. 

If anyone have preference over the 3 options or want to suggest anything new, then let me know.

Thanks

Bosco



On 7/15/16, 3:04 PM, "Edward Zhang" <yo...@gmail.com> wrote:

    Thanks Bosco.
    
    For new features, please do under develop branch where Eagle 0.5 is
    targeted.
    
    And in develop branch, we have different programming paradigm than before.
    Alert engine is separated into a general storm topology and the
    applications which prepare data is separate storm topologies so that output
    from application will be input to alert engine.
    
    So you can write application based on whatever framework you like, storm,
    spark etc. But in Eagle 0.5 we will only support storm based application,
    where Eagle provides a framework to manage application lifecycle.
    
    But at beginning, probably you just need write plain storm topology to
    process data from Apache Ranger.
    
    Thanks
    Edward
    
    On Fri, Jul 15, 2016 at 2:00 PM, Don Bosco Durai <bo...@apache.org> wrote:
    
    > I have some spare time and was planning to work on this. If no one
    > currently looking into this JIRA, then can you assign it to me?
    >
    > https://issues.apache.org/jira/browse/EAGLE-59
    >
    >
    > Thanks
    >
    > Bosco
    >
    >
    > On 11/29/15, 8:43 PM, "Don Bosco Durai" <bo...@apache.org> wrote:
    >
    >     Edward
    >
    >     Thanks. I will look into HdfsAuditLogProcessorMain class.
    >
    >     I will upload the sample files in the JIRA.
    >
    >
    >
    >     Thanks
    >
    >     Bosco
    >
    >
    >     On 11/29/15, 7:56 PM, "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com>
    > wrote:
    >
    >     >One more thing, Bosco, could you please copy some sample hdfs audit
    > log,
    >     >hbase log and hive log to here?
    >     >
    >     >I realize with Ranger data source, we probably still need some minor
    > code
    >     >development as follows
    >     >1. Substitute existing eagle data source(raw hdfs audit log) with
    > Ranger
    >     >data source, for example, in HdfsAuditLogProcessorMain, modify the
    > code to
    >     >use different log deserializer
    >     >2. Ensure output of Ranger log deserializer is compatible to existing
    >     >eagle data source.
    >     >
    >     >With the above code change, we can automatically get all capabilities
    > like
    >     >sensitivity data join, user hadoop command reassembly, hive query
    >     >semantics parsing etc.
    >     >
    >     >Thanks
    >     >Edward Zhang
    >     >
    >     >On 11/29/15, 18:52, "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com>
    > wrote:
    >     >
    >     >>Hi Bosco,
    >     >>
    >     >>Thanks for creating this ticket. It is very helpful if EAGLE can use
    >     >>Ranger as data source and automatically get monitoring capability in
    > 9
    >     >>Hadoop components.
    >     >>
    >     >>If a datasource is not from Kafka, and needs a lot of
    > pre-processing, it
    >     >>is not trivial to integrate that data source.
    >     >>
    >     >>Ranger¹s data source should be uniform in syntax and the integration
    >     >>should be straightforward, if we have a uniform deserializer.
    >     >>
    >     >>I think we can document the steps of integrating a new datasource.
    >     >>
    >     >>Thanks
    >     >>Edward Zhang
    >     >>
    >     >>On 11/29/15, 12:00, "Don Bosco Durai" <bo...@apache.org> wrote:
    >     >>
    >     >>>Hi Eagle team
    >     >>>
    >     >>>I am excited to see all the activities on this project. I have
    > created a
    >     >>>JIRA (https://issues.apache.org/jira/browse/EAGLE-59) to track the
    >     >>>integration with Apache Ranger.
    >     >>>
    >     >>>One way to integrate is for Ranger to send the audit logs in the
    > same way
    >     >>>as native log format to Kafka. However, Ranger already is doing the
    >     >>>normalization of the audit format for all the components. So
    >     >>>reconstructing might not be a good way to go.
    >     >>>
    >     >>>I am still getting familiar with the internals of Apache Eagle, but
    > if
    >     >>>someone can help me or document how a 3rd party source can be
    > integrated
    >     >>>with Apache Eagle, then it will be great. Also, what is the change
    >     >>>required on the analytics side to support new data sources? E.g. If
    > we
    >     >>>integrate with Ranger Audit Logs, we would get audit logs from
    > around 9
    >     >>>components right away. How can we use it?
    >     >>>
    >     >>>If you are okay, I am willing to work on this JIRA.
    >     >>>
    >     >>>Thanks
    >     >>>
    >     >>>Bosco
    >     >>>
    >     >>>
    >     >>
    >     >
    >
    >
    >
    >
    >

Re: Apache Ranger integration for Audit Logs...

Posted by Don Bosco Durai <bo...@apache.org>.

Thanks, I will work on the “develop” branch.

I have couple of possible designs in my mind:

1. Enhance Ranger to write the audit logs to file in JSON format and use the LogFeeder from Apache Ambari  to send them to Kafka topic. From where a Storm topology can transform it to something Eagle can understand
2. Use Kafka log4j to write Ranger Audits directly to a Kafka topic. From there, use a Storm topology which can transform events into Eagle format
3. Writing first class Kafka destination for Ranger Audit Destination. 

With #1 option, we will have dependency with Ambari, but none of the Hadoop components needs to updated with dependencies of Kafka client libraries.

With #2 option, we don’t need another process to monitor and publish logs to Kafka. But we will have to copy the Kafka client libraries to each Hadoop component. But rely on the Kafka log4j to handle all error handlings, including non availability of Kafka brokers.

With #3, we will need to do some work on Ranger side to implement the destination. And also copy the Kafka client library files to each Hadoop component. However, first class Kafka destination will give a lot of flexibility in handling batch processing, availability (implicit store and forward), Kerberos and other flexibilities.

Regardless, the Storm topology on the Eagle side to parse Ranger logs will be the same. 

If anyone have preference over the 3 options or want to suggest anything new, then let me know.

Thanks

Bosco



On 7/15/16, 3:04 PM, "Edward Zhang" <yo...@gmail.com> wrote:

    Thanks Bosco.
    
    For new features, please do under develop branch where Eagle 0.5 is
    targeted.
    
    And in develop branch, we have different programming paradigm than before.
    Alert engine is separated into a general storm topology and the
    applications which prepare data is separate storm topologies so that output
    from application will be input to alert engine.
    
    So you can write application based on whatever framework you like, storm,
    spark etc. But in Eagle 0.5 we will only support storm based application,
    where Eagle provides a framework to manage application lifecycle.
    
    But at beginning, probably you just need write plain storm topology to
    process data from Apache Ranger.
    
    Thanks
    Edward
    
    On Fri, Jul 15, 2016 at 2:00 PM, Don Bosco Durai <bo...@apache.org> wrote:
    
    > I have some spare time and was planning to work on this. If no one
    > currently looking into this JIRA, then can you assign it to me?
    >
    > https://issues.apache.org/jira/browse/EAGLE-59
    >
    >
    > Thanks
    >
    > Bosco
    >
    >
    > On 11/29/15, 8:43 PM, "Don Bosco Durai" <bo...@apache.org> wrote:
    >
    >     Edward
    >
    >     Thanks. I will look into HdfsAuditLogProcessorMain class.
    >
    >     I will upload the sample files in the JIRA.
    >
    >
    >
    >     Thanks
    >
    >     Bosco
    >
    >
    >     On 11/29/15, 7:56 PM, "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com>
    > wrote:
    >
    >     >One more thing, Bosco, could you please copy some sample hdfs audit
    > log,
    >     >hbase log and hive log to here?
    >     >
    >     >I realize with Ranger data source, we probably still need some minor
    > code
    >     >development as follows
    >     >1. Substitute existing eagle data source(raw hdfs audit log) with
    > Ranger
    >     >data source, for example, in HdfsAuditLogProcessorMain, modify the
    > code to
    >     >use different log deserializer
    >     >2. Ensure output of Ranger log deserializer is compatible to existing
    >     >eagle data source.
    >     >
    >     >With the above code change, we can automatically get all capabilities
    > like
    >     >sensitivity data join, user hadoop command reassembly, hive query
    >     >semantics parsing etc.
    >     >
    >     >Thanks
    >     >Edward Zhang
    >     >
    >     >On 11/29/15, 18:52, "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com>
    > wrote:
    >     >
    >     >>Hi Bosco,
    >     >>
    >     >>Thanks for creating this ticket. It is very helpful if EAGLE can use
    >     >>Ranger as data source and automatically get monitoring capability in
    > 9
    >     >>Hadoop components.
    >     >>
    >     >>If a datasource is not from Kafka, and needs a lot of
    > pre-processing, it
    >     >>is not trivial to integrate that data source.
    >     >>
    >     >>Ranger¹s data source should be uniform in syntax and the integration
    >     >>should be straightforward, if we have a uniform deserializer.
    >     >>
    >     >>I think we can document the steps of integrating a new datasource.
    >     >>
    >     >>Thanks
    >     >>Edward Zhang
    >     >>
    >     >>On 11/29/15, 12:00, "Don Bosco Durai" <bo...@apache.org> wrote:
    >     >>
    >     >>>Hi Eagle team
    >     >>>
    >     >>>I am excited to see all the activities on this project. I have
    > created a
    >     >>>JIRA (https://issues.apache.org/jira/browse/EAGLE-59) to track the
    >     >>>integration with Apache Ranger.
    >     >>>
    >     >>>One way to integrate is for Ranger to send the audit logs in the
    > same way
    >     >>>as native log format to Kafka. However, Ranger already is doing the
    >     >>>normalization of the audit format for all the components. So
    >     >>>reconstructing might not be a good way to go.
    >     >>>
    >     >>>I am still getting familiar with the internals of Apache Eagle, but
    > if
    >     >>>someone can help me or document how a 3rd party source can be
    > integrated
    >     >>>with Apache Eagle, then it will be great. Also, what is the change
    >     >>>required on the analytics side to support new data sources? E.g. If
    > we
    >     >>>integrate with Ranger Audit Logs, we would get audit logs from
    > around 9
    >     >>>components right away. How can we use it?
    >     >>>
    >     >>>If you are okay, I am willing to work on this JIRA.
    >     >>>
    >     >>>Thanks
    >     >>>
    >     >>>Bosco
    >     >>>
    >     >>>
    >     >>
    >     >
    >
    >
    >
    >
    >

Re: Apache Ranger integration for Audit Logs...

Posted by Edward Zhang <yo...@gmail.com>.

Thanks Bosco.

For new features, please do under develop branch where Eagle 0.5 is
targeted.

And in develop branch, we have different programming paradigm than before.
Alert engine is separated into a general storm topology and the
applications which prepare data is separate storm topologies so that output
from application will be input to alert engine.

So you can write application based on whatever framework you like, storm,
spark etc. But in Eagle 0.5 we will only support storm based application,
where Eagle provides a framework to manage application lifecycle.

But at beginning, probably you just need write plain storm topology to
process data from Apache Ranger.

Thanks
Edward

On Fri, Jul 15, 2016 at 2:00 PM, Don Bosco Durai <bo...@apache.org> wrote:

> I have some spare time and was planning to work on this. If no one
> currently looking into this JIRA, then can you assign it to me?
>
> https://issues.apache.org/jira/browse/EAGLE-59
>
>
> Thanks
>
> Bosco
>
>
> On 11/29/15, 8:43 PM, "Don Bosco Durai" <bo...@apache.org> wrote:
>
>     Edward
>
>     Thanks. I will look into HdfsAuditLogProcessorMain class.
>
>     I will upload the sample files in the JIRA.
>
>
>
>     Thanks
>
>     Bosco
>
>
>     On 11/29/15, 7:56 PM, "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com>
> wrote:
>
>     >One more thing, Bosco, could you please copy some sample hdfs audit
> log,
>     >hbase log and hive log to here?
>     >
>     >I realize with Ranger data source, we probably still need some minor
> code
>     >development as follows
>     >1. Substitute existing eagle data source(raw hdfs audit log) with
> Ranger
>     >data source, for example, in HdfsAuditLogProcessorMain, modify the
> code to
>     >use different log deserializer
>     >2. Ensure output of Ranger log deserializer is compatible to existing
>     >eagle data source.
>     >
>     >With the above code change, we can automatically get all capabilities
> like
>     >sensitivity data join, user hadoop command reassembly, hive query
>     >semantics parsing etc.
>     >
>     >Thanks
>     >Edward Zhang
>     >
>     >On 11/29/15, 18:52, "Zhang, Edward (GDI Hadoop)" <yo...@ebay.com>
> wrote:
>     >
>     >>Hi Bosco,
>     >>
>     >>Thanks for creating this ticket. It is very helpful if EAGLE can use
>     >>Ranger as data source and automatically get monitoring capability in
> 9
>     >>Hadoop components.
>     >>
>     >>If a datasource is not from Kafka, and needs a lot of
> pre-processing, it
>     >>is not trivial to integrate that data source.
>     >>
>     >>Ranger¹s data source should be uniform in syntax and the integration
>     >>should be straightforward, if we have a uniform deserializer.
>     >>
>     >>I think we can document the steps of integrating a new datasource.
>     >>
>     >>Thanks
>     >>Edward Zhang
>     >>
>     >>On 11/29/15, 12:00, "Don Bosco Durai" <bo...@apache.org> wrote:
>     >>
>     >>>Hi Eagle team
>     >>>
>     >>>I am excited to see all the activities on this project. I have
> created a
>     >>>JIRA (https://issues.apache.org/jira/browse/EAGLE-59) to track the
>     >>>integration with Apache Ranger.
>     >>>
>     >>>One way to integrate is for Ranger to send the audit logs in the
> same way
>     >>>as native log format to Kafka. However, Ranger already is doing the
>     >>>normalization of the audit format for all the components. So
>     >>>reconstructing might not be a good way to go.
>     >>>
>     >>>I am still getting familiar with the internals of Apache Eagle, but
> if
>     >>>someone can help me or document how a 3rd party source can be
> integrated
>     >>>with Apache Eagle, then it will be great. Also, what is the change
>     >>>required on the analytics side to support new data sources? E.g. If
> we
>     >>>integrate with Ranger Audit Logs, we would get audit logs from
> around 9
>     >>>components right away. How can we use it?
>     >>>
>     >>>If you are okay, I am willing to work on this JIRA.
>     >>>
>     >>>Thanks
>     >>>
>     >>>Bosco
>     >>>
>     >>>
>     >>
>     >
>
>
>
>
>