You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Vasilis Liaskovitis <vl...@gmail.com> on 2010/01/28 00:39:16 UTC

verifying that lzo compression is being used

I am trying to use lzo for intermediate map compression and gzip for
output compression in my hadoop-0.20.1 jobs. For lzo usage, I 've
compiled .jar and jni/native library from
http://code.google.com/p/hadoop-gpl-compression/ (version 0.1.0). Also
using native lzo library v2.03.

Is there an easy way to verify that the lzo compression is indeed
being used? My hadoop job output mentions native zlib as well as lzo
loading:

10/01/27 22:53:24 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
10/01/27 22:53:24 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library
10/01/27 22:53:25 INFO util.NativeCodeLoader: Loaded the native-hadoop library
10/01/27 22:53:25 INFO zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
10/01/27 22:53:25 INFO compress.CodecPool: Got brand-new compressor

However I am a bit suspicious as to whether lzo is actually used. I
've been doing some CPU profling (using oprofile) during jobs and I
don't see any cpu samples coming from lzo-related native or jni
symbols, e.g.  my system's liblzo.so or libgplcompression.so do not
appear. But I do see a lot of samples coming from libz.so and
libzip.so symbols. This profling makes me think that I may not
actually be using from the compressor I want to use:

Are there any other indications or hadoop/system logs that would prove
the use of lzo? Is it possible that lzo is overridden by zlib? I see
lzo is first loaded, and then zlib is loaded, judging from the job
output.

These are the relevant hadoop configs in my mapred-site.xml

<property>
    <name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>

<property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>

<property>
  <name>mapred.output.compression.codec</name>
  <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

<property>
  <name>mapred.output.compress</name>
  <value>true</value>
  <description>Should the job outputs be compressed?
  </description>
</property>

<property>
  <name>mapred.compress.map.output</name>
  <value>true</value>
  <description>Should the outputs of the maps be compressed before being
               sent across the network. Uses SequenceFile compression.
  </description>
</property>

<property>
  <name>mapred.output.compression.type</name>
  <value>BLOCK</value>
  <description>If the job outputs are to compressed as SequenceFiles, how should
               they be compressed? Should be one of NONE, RECORD or BLOCK.
  </description>
</property>

I am including the jni/native interface library libgplcompression.so
in the JAVA_LIBRARY_PATH of my child map/reduce tasks (through javaopt
-Djava.library.path). I.e. I am not using jira-2838 or jira-5981
patches.

Also I am specifying gzip for final output compression and not libzip
or libz- shouldn't libgz.so show up in my profiling? (Btw my oprofile
and hadoop tasks are set to use a jvmti agent suitable for java
profiling).

The tests are being run on a SLES10 cluster.
thanks for any help,

- Vasilis