You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Peter Cogan <pe...@gmail.com> on 2013/01/11 20:13:46 UTC
Simple map-only job to create Block Sequence Files compressed with Snappy

Hi there,

I am trying to create a map-only job which takes as input some log files
and simply converts them into sequence files compressed with Snappy.

Although the job runs with no error - the output file that is created is
pretty much the same size as the file I started with. Really confused!

I've pasted the full script and the hadoop output below

The output file is just named part-m-00000 - this is the resultant map
output file that seems to have the same size as the input file

thanks!
Peter






public class snappyMapOutput {


 public static class MapFunction

extends Mapper<Object, Text, LongWritable, Text>{

  public void map(LongWritable key, Text value, Context context

) throws IOException, InterruptedException {


 context.write(key, value);

}

}


 public static void main(String[] args) throws Exception {


 Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();


 conf.setBoolean("mapred.compress.map.output", true);

conf.set("mapred.map.output.compression.codec",
"org.apache.hadoop.io.compress.SnappyCodec");

conf.set("mapred.output.compression.type", "BLOCK");

  Job job = new Job(conf, "Convert to BLOCK Sequence File Snappy Compressed"
);

job.setJarByClass(snappyMapOutput.class);

  job.setMapperClass(MapFunction.class);

job.setNumReduceTasks(0);


  job.setMapOutputKeyClass(LongWritable.class);

job.setMapOutputValueClass(Text.class);


 job.setOutputFormatClass(SequenceFileOutputFormat.class);

 FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));


 System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}



13/01/11 19:19:38 INFO input.FileInputFormat: Total input paths to process
: 1
13/01/11 19:19:38 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/01/11 19:19:38 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
13/01/11 19:19:38 WARN snappy.LoadSnappy: Snappy native library is available
13/01/11 19:19:38 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/01/11 19:19:38 INFO snappy.LoadSnappy: Snappy native library loaded
13/01/11 19:19:39 INFO mapred.JobClient: Running job: job_201301111838_0006
13/01/11 19:19:40 INFO mapred.JobClient:  map 0% reduce 0%
13/01/11 19:19:45 INFO mapred.JobClient:  map 100% reduce 0%
13/01/11 19:19:45 INFO mapred.JobClient: Job complete: job_201301111838_0006
13/01/11 19:19:45 INFO mapred.JobClient: Counters: 19
13/01/11 19:19:45 INFO mapred.JobClient:   Job Counters
13/01/11 19:19:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4566
13/01/11 19:19:45 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
13/01/11 19:19:45 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
13/01/11 19:19:45 INFO mapred.JobClient:     Launched map tasks=1
13/01/11 19:19:45 INFO mapred.JobClient:     Data-local map tasks=1
13/01/11 19:19:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/01/11 19:19:45 INFO mapred.JobClient:   File Output Format Counters
13/01/11 19:19:45 INFO mapred.JobClient:     Bytes Written=72951075
13/01/11 19:19:45 INFO mapred.JobClient:   FileSystemCounters
13/01/11 19:19:45 INFO mapred.JobClient:     HDFS_BYTES_READ=70983803
13/01/11 19:19:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=24107
13/01/11 19:19:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=72951075
13/01/11 19:19:45 INFO mapred.JobClient:   File Input Format Counters
13/01/11 19:19:45 INFO mapred.JobClient:     Bytes Read=70983680
13/01/11 19:19:45 INFO mapred.JobClient:   Map-Reduce Framework
13/01/11 19:19:45 INFO mapred.JobClient:     Map input records=79756
13/01/11 19:19:45 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=109174784
13/01/11 19:19:45 INFO mapred.JobClient:     Spilled Records=0
13/01/11 19:19:45 INFO mapred.JobClient:     CPU time spent (ms)=2040
13/01/11 19:19:45 INFO mapred.JobClient:     Total committed heap usage
(bytes)=187105280
13/01/11 19:19:45 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=1084190720
13/01/11 19:19:45 INFO mapred.JobClient:     Map output records=79756
13/01/11 19:19:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=123