You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Paul Ho <ph...@walmart.com> on 2012/01/11 02:56:56 UTC

getting file position for a LZO file

Hi all,

For the TextInputFormat class, the input key is a file position. This is working well. But when I switch to LzoTextInputFormat to read LZO files, the key does not make sense. It does not indicate file position. Is the file position supported with LzoTextInputFormat? 

Here is a job that prints out file position and line.

public class Test {

    public static class Map extends Mapper<LongWritable, Text, LongWritable, Text> {

        private Text outputValue = new Text();

        /*
         *  Outputs key,value pair.
         *    key = offset
         *    value = string
         */
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String s = value.toString();
            if (s.length() > 64) {
                s = s.substring(0, 64);
            }
            this.outputValue.set(s);
            context.write(key, this.outputValue);
        }

    }

    public static void main(String[] args) throws Exception {
        Configuration c = new Configuration();

        Job j = new Job(c, "Test");

        j.setJarByClass(TomcatLogTest.class);

	FileInputFormat.addInputPath(j, new Path(args[0]));
        FileOutputFormat.setOutputPath(j, new Path(args[1]));

        j.setMapperClass(Map.class);

        j.setInputFormatClass(LzoTextInputFormat.class);
        j.setOutputFormatClass(TextOutputFormat.class);

        j.setMapOutputKeyClass(LongWritable.class);
        j.setMapOutputValueClass(Text.class);

        j.setOutputKeyClass(LongWritable.class);
        j.setOutputValueClass(Text.class);

        if (!j.waitForCompletion(true)) {
            System.exit(1);
        }
    }

}


The output is:

0	[WEB.WWW.WARNING.30000][Mon 2012/01/09 14:00:00:933 PST][com.wm.
101200	=DynamicItem to String MethodDynamicItem{id=15762417, timestamp=
101200	{
101200	2012-01-09 14:16:19:195 - TP-Processor2, 29718094 -> L2 STRAND B
101200	2012-01-09 14:16:19:192 - TP-Processor2, 29718094 -> hostName=ed
101200	2012-01-09 14:16:19:186 - pool-113-thread-2, 11661605 -> hostNam
101200	SESSION FILTER BENCH: pre-process 0 millis <SessionID: 000000086
101200	TOMCAT REQ: /ip/Archangels-Chessmen/17703726 Mon Jan 09 14:16:19
101200	TIMESTAMP: Mon Jan 9 14:16:11 PST 2012
101200	TOMCAT BENCH: /verify.gsp?novisitor=true&noses=true 3 elapsed Mo
101200	
101200	[WEB.WWW.WARNING.PLATFORM][Mon 2012/01/09 14:16:11:778 PST][com.
101200	
101200	[WEB.WWW.WARNING.PLATFORM][Mon 2012/01/09 14:16:11:778 PST][com.
101200	TOMCAT REQ: /verify.gsp?novisitor=true&noses=true Mon Jan 09 14:
101200	TOMCAT BENCH: /verify.gsp?novisitor=true&noses=true 3 elapsed Mo
101200	
101200	[WEB.WWW.WARNING.PLATFORM][Mon 2012/01/09 14:16:03:767 PST][com.
...

The file position does change but it does not make sense to me. Is there any way to get the file position of a line so I can print out that line later?

Any help would be helpful!

Thanks!