You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ranger.apache.org by "Ramesh Mani (JIRA)" <ji...@apache.org> on 2018/01/29 18:52:00 UTC
[jira] [Comment Edited] (RANGER-1837) Enhance Ranger Audit to HDFS to support ORC file format

    [ https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341900#comment-16341900 ] 

Ramesh Mani edited comment on RANGER-1837 at 1/29/18 6:51 PM:
--------------------------------------------------------------

[~bosco] [~madhan.neethiraj] [~risdenk]
 I have attached the next revision on the patch for your review
 1) Address ORC file creation in HDFS destination

2) Bulk copy of JSON files from local to HDFS destination.

3) Audit to Solr will be using the existing BatchQueue instead of AuditFileQueue, so it follows the existing flow rate( i.e, both the destination Solr and HDFS will have different audit follow rate)

4) AuditFileQueue Flow rate will be depending on the filequeue rollover time.
 There are change to the parameter to enable the AuditFileQueue which I have document in this note. I have been testing this with HDFS as service. Shall test other service as well.
{code:java}
ORC FILE FORMAT in HDFS Ranger Audit log with local audit file store as source for HDFS audit:
	NOTE: When this is done each records in the local file will be read for creating the ORC File.

    1. Enable Ranger Audit to HDFS in ORC file format using AuditFileQueue
        - To enable Ranger Audit to HDFS with ORC format, we need to first enable AuditFileQueue to spool the audit to local first.
            * In Namenode host, create spool directory and make sure the path can be read/write/execute for owner of the Service for which Ranger plugin is enabled ( e.g HDFS Service it is hdfs:hadoop, Hive Service it is hive:hadoop ..etc)

                $ mkdir -p  /var/log/hadoop/audit/staging/spool
                $ cd /var/log/hadoop/audit/staging/spool
                $ chown hdfs:hadoop spool

            * Enable AuditFileQueue via following params in ranger-<component>-audit.xml
               xasecure.audit.destination.hdfs.batch.queuetype=filequeue (NOTE: default = memqueue which is the behaviour where a  memory queue / buffer is used  instead of Local File buffer)
			   xasecure.audit.destination.hdfs.batch.filequeue.filespool.file.rollover.sec=300    ( This will determine the batch size for ORC file which is created)
               xasecure.audit.destination.hdfs.batch.filequeue.filespool.dir=/var/log/hadoop/audit/staging/spool  ( This is the local staging directory for audit)
               xasecure.audit.destination.hdfs.batch.filequeue.filespool.buffer.size=10000  ( This will determine batch size for ORC file creation alone with rollover.sec parameter)

    2. Enable ORC fileformat for Ranger HDFS Audit.
          - This is done by having the following param in ranger-<component>-audit.xml. By default the value is "json"

            xasecure.audit.destination.hdfs.filetype=orc ( default = json )

    3. Provision to control the compression techniques for ORC format. Default is 'snappy'
            xasecure.audit.destination.hdfs.orc.compression=snappy|lzo|zlip|none

    4. Buffer Size and Stripe Size of ORC file batch. Default is '10000' bytes and '100000' bytes respectively. This will decide the batch size on ORC file in hdfs.
            xasecure.audit.destination.hdfs.orc.buffersize= (value in bytes)
            xasecure.audit.destination.hdfs.orc.stripesize= (value in bytes)

    5. Hive Query to create ORC table with default 'snappy' compresssion.

        CREATE EXTERNAL TABLE ranger_audit_event (
        repositoryType int,
        repositoryName string,
        reqUser string,
        evtTime string,
        accessType string,
        resourcePath string,
        resourceType string,
        action  string,
        accessResult string,
        agentId string,
        policyId  bigint,
        resultReason string,
        aclEnforcer string,
        sessionId string,
        clientType string,
        clientIP string,
        requestData string,
        clusterName string
        )
        STORED AS ORC
        LOCATION '/ranger/audit/hdfs'
        TBLPROPERTIES  ("orc.compress"="SNAPPY");

{code}
{code:java}
JSON FILE FORMAT in HDFS Ranger Audit log with local audit file store as source for HDFS audit:
	NOTE: When this is done each local file will be copied entirely into HDFS destination. This enables us to generate Ranger audit files in HDFS which are larger in size which is a preferred.
	
	 1. Enable Ranger Audit to HDFS in JSON file format using AuditFileQueue
        - To enable Ranger Audit to HDFS with JSON format and local file cached, we need to first enable AuditFileQueue to spool the audit to locally.

            * In Namenode host, create spool directory and make sure the path can be read/write/execute for owner of the Service for which Ranger plugin is enabled (e.g HDFS Service it is hdfs:hadoop, Hive Service it is hive:hadoop ..etc)

                $ mkdir -p  /var/log/hadoop/audit/staging/spool
                $ cd /var/log/hadoop/audit/staging/spool
                $ chown hdfs:hadoop spool

            * Enable AuditFileQueue via following params in ranger-<component>-audit.xml
               xasecure.audit.destination.hdfs.batch.queuetype=filequeue ( NOTE: default = memqueue which is the behaviour where a  memory queue / buffer is used  instead of Local File buffer)
			   xasecure.audit.destination.hdfs.batch.filequeue.filespool.file.rollover.sec=300    ( This will determine the JSON file size which will be copied to HDFS)
               xasecure.audit.destination.hdfs.batch.filequeue.filespool.dir=/var/log/hadoop/audit/staging/spool  ( This is the local staging directory for audit)
               
{code}


was (Author: rmani):
[~bosco][~madhan.neethiraj][~risdenk]
I have attached the next revision on the patch for your review
1) Address ORC file creation in HDFS destination
2) bulk copy of JSON files from local to HDFS destination.
There are change to the parameter to enable the AuditFileQueue which I have document in this note. I have been testing this with HDFS as service.  Shall test other service as well.

{code:java}
ORC FILE FORMAT in HDFS Ranger Audit log with local audit file store as source for HDFS audit:
	NOTE: When this is done each records in the local file will be read for creating the ORC File.

    1. Enable Ranger Audit to HDFS in ORC file format using AuditFileQueue
        - To enable Ranger Audit to HDFS with ORC format, we need to first enable AuditFileQueue to spool the audit to local first.
            * In Namenode host, create spool directory and make sure the path can be read/write/execute for owner of the Service for which Ranger plugin is enabled ( e.g HDFS Service it is hdfs:hadoop, Hive Service it is hive:hadoop ..etc)

                $ mkdir -p  /var/log/hadoop/audit/staging/spool
                $ cd /var/log/hadoop/audit/staging/spool
                $ chown hdfs:hadoop spool

            * Enable AuditFileQueue via following params in ranger-<component>-audit.xml
               xasecure.audit.destination.hdfs.batch.queuetype=filequeue (NOTE: default = memqueue which is the behaviour where a  memory queue / buffer is used  instead of Local File buffer)
			   xasecure.audit.destination.hdfs.batch.filequeue.filespool.file.rollover.sec=300    ( This will determine the batch size for ORC file which is created)
               xasecure.audit.destination.hdfs.batch.filequeue.filespool.dir=/var/log/hadoop/audit/staging/spool  ( This is the local staging directory for audit)
               xasecure.audit.destination.hdfs.batch.filequeue.filespool.buffer.size=10000  ( This will determine batch size for ORC file creation alone with rollover.sec parameter)

    2. Enable ORC fileformat for Ranger HDFS Audit.
          - This is done by having the following param in ranger-<component>-audit.xml. By default the value is "json"

            xasecure.audit.destination.hdfs.filetype=orc ( default = json )

    3. Provision to control the compression techniques for ORC format. Default is 'snappy'
            xasecure.audit.destination.hdfs.orc.compression=snappy|lzo|zlip|none

    4. Buffer Size and Stripe Size of ORC file batch. Default is '10000' bytes and '100000' bytes respectively. This will decide the batch size on ORC file in hdfs.
            xasecure.audit.destination.hdfs.orc.buffersize= (value in bytes)
            xasecure.audit.destination.hdfs.orc.stripesize= (value in bytes)

    5. Hive Query to create ORC table with default 'snappy' compresssion.

        CREATE EXTERNAL TABLE ranger_audit_event (
        repositoryType int,
        repositoryName string,
        reqUser string,
        evtTime string,
        accessType string,
        resourcePath string,
        resourceType string,
        action  string,
        accessResult string,
        agentId string,
        policyId  bigint,
        resultReason string,
        aclEnforcer string,
        sessionId string,
        clientType string,
        clientIP string,
        requestData string,
        clusterName string
        )
        STORED AS ORC
        LOCATION '/ranger/audit/hdfs'
        TBLPROPERTIES  ("orc.compress"="SNAPPY");

{code}


{code:java}

JSON FILE FORMAT in HDFS Ranger Audit log with local audit file store as source for HDFS audit:
	NOTE: When this is done each local file will be copied entirely into HDFS destination. This enables us to generate Ranger audit files in HDFS which are larger in size which is a preferred.
	
	 1. Enable Ranger Audit to HDFS in JSON file format using AuditFileQueue
        - To enable Ranger Audit to HDFS with JSON format and local file cached, we need to first enable AuditFileQueue to spool the audit to locally.

            * In Namenode host, create spool directory and make sure the path can be read/write/execute for owner of the Service for which Ranger plugin is enabled (e.g HDFS Service it is hdfs:hadoop, Hive Service it is hive:hadoop ..etc)

                $ mkdir -p  /var/log/hadoop/audit/staging/spool
                $ cd /var/log/hadoop/audit/staging/spool
                $ chown hdfs:hadoop spool

            * Enable AuditFileQueue via following params in ranger-<component>-audit.xml
               xasecure.audit.destination.hdfs.batch.queuetype=filequeue ( NOTE: default = memqueue which is the behaviour where a  memory queue / buffer is used  instead of Local File buffer)
			   xasecure.audit.destination.hdfs.batch.filequeue.filespool.file.rollover.sec=300    ( This will determine the JSON file size which will be copied to HDFS)
               xasecure.audit.destination.hdfs.batch.filequeue.filespool.dir=/var/log/hadoop/audit/staging/spool  ( This is the local staging directory for audit)
               
{code}






> Enhance Ranger Audit to HDFS to support ORC file format
> -------------------------------------------------------
>
>                 Key: RANGER-1837
>                 URL: https://issues.apache.org/jira/browse/RANGER-1837
>             Project: Ranger
>          Issue Type: Improvement
>          Components: audit
>            Reporter: Kevin Risden
>            Assignee: Ramesh Mani
>            Priority: Major
>         Attachments: 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-002.patch, 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch, AuditDataFlow.png
>
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in HDFS itself as JSON files in one folder per day. I have loaded these JSON files from the folder into Hive as compressed ORC format. The compressed files in ORC were less than 10% of the original size. So, it was significant decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)