You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ranger.apache.org by "Kevin Risden (JIRA)" <ji...@apache.org> on 2017/10/13 13:10:00 UTC

[jira] [Comment Edited] (RANGER-1837) HDFS Audit Compression

    [ https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203539#comment-16203539 ] 

Kevin Risden edited comment on RANGER-1837 at 10/13/17 1:09 PM:
----------------------------------------------------------------

Also from Bosco on the mailing list:
{quote}
If we write as ORC or other file format directly, then we have to see how to batch the audits. In the Audit V3 implementation, we did some optimization to avoid store (local write) and forward, instead build the batch in the memory itself and do bulk write (each Destination has different policies). But in the previous release, we did re-introduce an option to store and forward to HDFS due to HDFS file closure issue.
 
I personally don’t know what would be a good batch size. But we can build on top that code to write in the format we want to. And make the output write configurable to support different types.
{quote}

From Ramesh on the mailing list:
{quote}
+1 for your suggestion on having a Audit FileFormat as a feature in the Ranger Audit Framework.  

In that case HDFSAuditDestination should have the provision to use a FileFormat before writing, where as SolrDestination might not require this.  

Each configured AuditDestination can have a Format conversion before writing, we don’t need to have this format all the way from Audit generation point.
{quote}


was (Author: risdenk):
Also from Bosco on the mailing list:
{quote}
f we write as ORC or other file format directly, then we have to see how to batch the audits. In the Audit V3 implementation, we did some optimization to avoid store (local write) and forward, instead build the batch in the memory itself and do bulk write (each Destination has different policies). But in the previous release, we did re-introduce an option to store and forward to HDFS due to HDFS file closure issue.
 
I personally don’t know what would be a good batch size. But we can build on top that code to write in the format we want to. And make the output write configurable to support different types.
{quote}

From Ramesh on the mailing list:
{quote}
+1 for your suggestion on having a Audit FileFormat as a feature in the Ranger Audit Framework.  

In that case HDFSAuditDestination should have the provision to use a FileFormat before writing, where as SolrDestination might not require this.  

Each configured AuditDestination can have a Format conversion before writing, we don’t need to have this format all the way from Audit generation point.
{quote}

> HDFS Audit Compression
> ----------------------
>
>                 Key: RANGER-1837
>                 URL: https://issues.apache.org/jira/browse/RANGER-1837
>             Project: Ranger
>          Issue Type: Improvement
>          Components: audit
>            Reporter: Kevin Risden
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in HDFS itself as JSON files in one folder per day. I have loaded these JSON files from the folder into Hive as compressed ORC format. The compressed files in ORC were less than 10% of the original size. So, it was significant decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)