You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Andy Sautins <an...@returnpath.net> on 2009/08/18 22:47:14 UTC

Looking for advice on file structure...

   All,

   I'm looking for a little bit of advice on how to format files.

   The problem I have is I have log files from a number of different sources.  The data elements between log files overlaps by about 80%, but there are unique data items in each of the log files that I want to keep and be able to access from my Map/Reduce jobs.  There also isn't a single obvious key to the log file entries.  A quick example would be two different log files.  Log file a has 3 columns of data types A,B,C and is tab delimited.  Log file 2 has data types A,B,C,D and is pipe delimited.  I'd like to pre-process them into files where in the map/reduce job I could consistently access data element A across both types of log files and also access element D if it exists.

    .I suspect the best answer would be to pre-process the files into a common file format that allows for variable data values within a log line.   What I'm wondering is, has anyone else solved this type of problem and did you find a solution you liked?

   Where I've been looking so far is to use SequenceFiles.  There isn't a logical key, so the key in the sequence file my thought was to just have a line number, similar to the default map file input format although that feels a little weird.  For the value, since I want somewhat arbitrary key/values for the SequenceFile value my thought was to just have the value as a serialized HashMap.

    Any thoughts as to if I'm trying to re-invent the wheel here or going off in a strange direction?

    Thanks

    Andy