You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Peter Cogan <pe...@gmail.com> on 2013/01/11 20:13:46 UTC
Simple map-only job to create Block Sequence Files compressed with Snappy
Hi there,
I am trying to create a map-only job which takes as input some log files
and simply converts them into sequence files compressed with Snappy.
Although the job runs with no error - the output file that is created is
pretty much the same size as the file I started with. Really confused!
I've pasted the full script and the hadoop output below
The output file is just named part-m-00000 - this is the resultant map
output file that seems to have the same size as the input file
thanks!
Peter
public class snappyMapOutput {
public static class MapFunction
extends Mapper<Object, Text, LongWritable, Text>{
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
context.write(key, value);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
conf.setBoolean("mapred.compress.map.output", true);
conf.set("mapred.map.output.compression.codec",
"org.apache.hadoop.io.compress.SnappyCodec");
conf.set("mapred.output.compression.type", "BLOCK");
Job job = new Job(conf, "Convert to BLOCK Sequence File Snappy Compressed"
);
job.setJarByClass(snappyMapOutput.class);
job.setMapperClass(MapFunction.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
13/01/11 19:19:38 INFO input.FileInputFormat: Total input paths to process
: 1
13/01/11 19:19:38 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/01/11 19:19:38 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
13/01/11 19:19:38 WARN snappy.LoadSnappy: Snappy native library is available
13/01/11 19:19:38 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/01/11 19:19:38 INFO snappy.LoadSnappy: Snappy native library loaded
13/01/11 19:19:39 INFO mapred.JobClient: Running job: job_201301111838_0006
13/01/11 19:19:40 INFO mapred.JobClient: map 0% reduce 0%
13/01/11 19:19:45 INFO mapred.JobClient: map 100% reduce 0%
13/01/11 19:19:45 INFO mapred.JobClient: Job complete: job_201301111838_0006
13/01/11 19:19:45 INFO mapred.JobClient: Counters: 19
13/01/11 19:19:45 INFO mapred.JobClient: Job Counters
13/01/11 19:19:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4566
13/01/11 19:19:45 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
13/01/11 19:19:45 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
13/01/11 19:19:45 INFO mapred.JobClient: Launched map tasks=1
13/01/11 19:19:45 INFO mapred.JobClient: Data-local map tasks=1
13/01/11 19:19:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/01/11 19:19:45 INFO mapred.JobClient: File Output Format Counters
13/01/11 19:19:45 INFO mapred.JobClient: Bytes Written=72951075
13/01/11 19:19:45 INFO mapred.JobClient: FileSystemCounters
13/01/11 19:19:45 INFO mapred.JobClient: HDFS_BYTES_READ=70983803
13/01/11 19:19:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=24107
13/01/11 19:19:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=72951075
13/01/11 19:19:45 INFO mapred.JobClient: File Input Format Counters
13/01/11 19:19:45 INFO mapred.JobClient: Bytes Read=70983680
13/01/11 19:19:45 INFO mapred.JobClient: Map-Reduce Framework
13/01/11 19:19:45 INFO mapred.JobClient: Map input records=79756
13/01/11 19:19:45 INFO mapred.JobClient: Physical memory (bytes)
snapshot=109174784
13/01/11 19:19:45 INFO mapred.JobClient: Spilled Records=0
13/01/11 19:19:45 INFO mapred.JobClient: CPU time spent (ms)=2040
13/01/11 19:19:45 INFO mapred.JobClient: Total committed heap usage
(bytes)=187105280
13/01/11 19:19:45 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=1084190720
13/01/11 19:19:45 INFO mapred.JobClient: Map output records=79756
13/01/11 19:19:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=123