You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Andy Sautins <an...@returnpath.net> on 2009/08/18 22:47:14 UTC

Looking for advice on file structure...

All,

I'm looking for a little bit of advice on how to format files.

The problem I have is I have log files from a number of different sources. The data elements between log files overlaps by about 80%, but there are unique data items in each of the log files that I want to keep and be able to access from my Map/Reduce jobs. There also isn't a single obvious key to the log file entries. A quick example would be two different log files. Log file a has 3 columns of data types A,B,C and is tab delimited. Log file 2 has data types A,B,C,D and is pipe delimited. I'd like to pre-process them into files where in the map/reduce job I could consistently access data element A across both types of log files and also access element D if it exists.

.I suspect the best answer would be to pre-process the files into a common file format that allows for variable data values within a log line. What I'm wondering is, has anyone else solved this type of problem and did you find a solution you liked?

Where I've been looking so far is to use SequenceFiles. There isn't a logical key, so the key in the sequence file my thought was to just have a line number, similar to the default map file input format although that feels a little weird. For the value, since I want somewhat arbitrary key/values for the SequenceFile value my thought was to just have the value as a serialized HashMap.

Any thoughts as to if I'm trying to re-invent the wheel here or going off in a strange direction?

Thanks

Andy