You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Paul Ho <ph...@walmart.com> on 2012/01/11 02:56:56 UTC
getting file position for a LZO file
Hi all,
For the TextInputFormat class, the input key is a file position. This is working well. But when I switch to LzoTextInputFormat to read LZO files, the key does not make sense. It does not indicate file position. Is the file position supported with LzoTextInputFormat?
Here is a job that prints out file position and line.
public class Test {
public static class Map extends Mapper<LongWritable, Text, LongWritable, Text> {
private Text outputValue = new Text();
/*
* Outputs key,value pair.
* key = offset
* value = string
*/
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
if (s.length() > 64) {
s = s.substring(0, 64);
}
this.outputValue.set(s);
context.write(key, this.outputValue);
}
}
public static void main(String[] args) throws Exception {
Configuration c = new Configuration();
Job j = new Job(c, "Test");
j.setJarByClass(TomcatLogTest.class);
FileInputFormat.addInputPath(j, new Path(args[0]));
FileOutputFormat.setOutputPath(j, new Path(args[1]));
j.setMapperClass(Map.class);
j.setInputFormatClass(LzoTextInputFormat.class);
j.setOutputFormatClass(TextOutputFormat.class);
j.setMapOutputKeyClass(LongWritable.class);
j.setMapOutputValueClass(Text.class);
j.setOutputKeyClass(LongWritable.class);
j.setOutputValueClass(Text.class);
if (!j.waitForCompletion(true)) {
System.exit(1);
}
}
}
The output is:
0 [WEB.WWW.WARNING.30000][Mon 2012/01/09 14:00:00:933 PST][com.wm.
101200 =DynamicItem to String MethodDynamicItem{id=15762417, timestamp=
101200 {
101200 2012-01-09 14:16:19:195 - TP-Processor2, 29718094 -> L2 STRAND B
101200 2012-01-09 14:16:19:192 - TP-Processor2, 29718094 -> hostName=ed
101200 2012-01-09 14:16:19:186 - pool-113-thread-2, 11661605 -> hostNam
101200 SESSION FILTER BENCH: pre-process 0 millis <SessionID: 000000086
101200 TOMCAT REQ: /ip/Archangels-Chessmen/17703726 Mon Jan 09 14:16:19
101200 TIMESTAMP: Mon Jan 9 14:16:11 PST 2012
101200 TOMCAT BENCH: /verify.gsp?novisitor=true&noses=true 3 elapsed Mo
101200
101200 [WEB.WWW.WARNING.PLATFORM][Mon 2012/01/09 14:16:11:778 PST][com.
101200
101200 [WEB.WWW.WARNING.PLATFORM][Mon 2012/01/09 14:16:11:778 PST][com.
101200 TOMCAT REQ: /verify.gsp?novisitor=true&noses=true Mon Jan 09 14:
101200 TOMCAT BENCH: /verify.gsp?novisitor=true&noses=true 3 elapsed Mo
101200
101200 [WEB.WWW.WARNING.PLATFORM][Mon 2012/01/09 14:16:03:767 PST][com.
...
The file position does change but it does not make sense to me. Is there any way to get the file position of a line so I can print out that line later?
Any help would be helpful!
Thanks!