You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2016/12/06 09:30:28 UTC

Hadoop compression on Nutch segments

Hi,

has anyone tried to use one of Hadoop's CompressionCodec for Nutch segments.
It could be worth to use another than DefaultCodec, mainly Bzip2 not only
because it will use less storage, but also because sometimes copying
larger data replicas around in the cluster is slower and more expensive
than the extra CPU time required for a tighter compression.  In my case,
it's about uploading the segments to AWS S3 which is also sometimes slow.

It turned out that the value of the property
 mapreduce.output.fileoutputformat.compress.codec
is not used when writing the segments - DefaultCodec is always used.

After passing the compress.codec to the segment writers
(see https://github.com/sebastian-nagel/nutch/tree/segment-compression-codec)
a test on a small maybe not representative sample showed that BZip2Codec
with BLOCK compression reduces the size of the content subdir by about 30%.
The other subdirs are too small in the sample to get relevant results
or are always compressed per RECORD (parse_text).

Before running a larger test I would like to hear about your experiences.

Thanks,
Sebastian