You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Amar Kamat (JIRA)" <ji...@apache.org> on 2011/07/12 13:49:00 UTC

[jira] [Updated] (MAPREDUCE-778) [Rumen] Need a standalone JobHistory log anonymizer

     [ https://issues.apache.org/jira/browse/MAPREDUCE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated MAPREDUCE-778:
---------------------------------

    Attachment: mapreduce-778-v1.2-2.patch

Attaching a patch that does a very basic level of anonymization of Rumen job traces and cluster topologies. This patch adds a new option in Rumen called 'Anonymizer' which anonymizes the specified Rumen job trace and/or cluster topology. The approach exploits Jackson's ability to accept custom object serializers.  

TODO:
1. Remove code duplication in Folder and HadoopLogsAnalyzer (i.e use DefaultOutputter)
2. Extract and anonymize important job configuration parameters from job properties. The current patch (v1.2-2) simply filters out (hides) all the job configuration parameters. The biggest challenge here it handle classnames (mapper/reducer/combiner etc) and paths (job input/output etc). Classnames that are open sourced or publicly available (eg. org.apache.*) can be filtered in without anonymization.
3. Strings like job-names can be intelligently masked to retain certain key characteristics. For example, job-names having keywords like 'monthly/daily/weekly/hourly' etc can be retained as they represent job characteristics.

> [Rumen] Need a standalone JobHistory log anonymizer
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-778
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-778
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: tools/rumen
>            Reporter: Hong Tang
>            Assignee: Amar Kamat
>              Labels: anonymization, rumen
>         Attachments: anonymizer.patch, anonymizer.py, mapreduce-778-v1.2-2.patch, same.py
>
>
> Job history logs contain a rich set of information that can help understand and characterize cluster workload and individual job execution. Examples of work that parses or utilizes job history include HADOOP-3585, MAPREDUCE-534, HDFS-459, MAPREDUCE-728, and MAPREDUCE-776. Some of the parsing tools developed in previous work already contains a component to anonymize the logs. It would be nice to combine these effort and have a common standalone tool that can anonymizes job history logs and preserve much of the structure of the files so that existing tools on top of job history logs continue work with no modification.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira