You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Amar Kamat (Updated) (JIRA)" <ji...@apache.org> on 2011/12/12 19:37:31 UTC

[jira] [Updated] (MAPREDUCE-778) [Rumen] Need a standalone JobHistory log anonymizer

     [ https://issues.apache.org/jira/browse/MAPREDUCE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated MAPREDUCE-778:
---------------------------------

    Attachment: mapreduce-778-v1.14-12.patch

Attaching a patch that adds the remaining features to the Anonymizer. 

Some newly added features:
1. Note that Gridmix runs with the anonymized trace.

2. Classname now have a filter which allows the user to specify which packages to pass through.

3. Job config parsing and filtering is done. Only MR (framework-level) configs are parsed and allowed. There is a config to extend this functionality and allow users to handle other keys. E.g. Pig etc.

Read the Rumen manual for details on the Anonymizer and its configuration parameters. 

Testing:
test-patch and ant tests passed. Also tested on 2 days worth of job history data. 

Todos:
1. Currently, the job properties only consideres MR (framework-level) properties. Add job config parsers for Input/Output file formats, pig, etc.
2. Chunking of data (esp. job names) to preserve some useful stats like daily/weekly/monthly etc.
                
> [Rumen] Need a standalone JobHistory log anonymizer
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-778
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-778
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: tools/rumen
>            Reporter: Hong Tang
>            Assignee: Amar Kamat
>              Labels: anonymization, rumen
>         Attachments: anonymizer.patch, anonymizer.py, mapreduce-778-v1.14-12.patch, mapreduce-778-v1.2-2.patch, same.py
>
>
> Job history logs contain a rich set of information that can help understand and characterize cluster workload and individual job execution. Examples of work that parses or utilizes job history include HADOOP-3585, MAPREDUCE-534, HDFS-459, MAPREDUCE-728, and MAPREDUCE-776. Some of the parsing tools developed in previous work already contains a component to anonymize the logs. It would be nice to combine these effort and have a common standalone tool that can anonymizes job history logs and preserve much of the structure of the files so that existing tools on top of job history logs continue work with no modification.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira