You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Friso van Vollenhoven <fv...@xebia.com> on 2011/10/28 12:12:32 UTC

Auditable Hadoop

Hi all,

I have a auditing challenge. I am looking for a quite detailed level of audit trail on MR jobs. I know that HDFS has a audit log, which you can write to a separate file through log4j config. But what I ideally need is something that allows to determine, with certainty, which jobs were run against what data and by whom. Now, with 'which jobs', I mean source code, not just the binary.

One idea I had for basic Java based MR jobs is to have devs (me) check the job's code in the some SCM and then have a tool that does a checkout of a particular branch on a machine that has cluster access and is not open to the dev team. The tool would do a checkout and a build and write the SCM tag + a hash of the binary (.jar) to a audit trail. Then I'd need some kind of hook in the job tracker that checks whether a submitted job's binary is properly audited. This way, you could always find the source that was executed.

Question is: is there a possibility for such a hook in the JT? Or do I need to patch it. It would be nice to have the auditing happen in the JT, such that the dev team can have regular access to the cluster (thus use the hadoop command line tool to copy/move files, etc.) and it would just reject jobs that have not been audited.

Also, with this model, non-Java based jobs are a problem. I probably won't be using streaming, but Pig, Hive and Mahout will likely be used. For these I'd need some additional steps: confirming that the Pig / Hive / Mahout binaries which are submitted are trusted ones and have Pig/Hive/Mahout add some config params or other info about the script / query that is being executed in the plan.

Does anyone have any ideas on this? Or relevant experiences?


Thanks,
Friso