You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Tim Robertson <ti...@gmail.com> on 2014/09/19 17:35:15 UTC

Merging deflated files into a zip efficiently - has it been done?

Hi all,

I need to generate Zip (yes really) files and I am looking to do this as
efficiently as possible.  The zips will hold the output of Hive queries,
but the Hive bit is irrelevant - this is a straight MR problem.

Ideally I'd compress text files in MR world and then merge then into a Zip
on the way out of the cluster so that a) the reducers are compressing
blocks in parallel b) data coming out of Hadoop is compressed so it is
bandwidth efficient and c) I can simply merge the compressed data on the
way out of HDFS so there is no single bottleneck, normally associated with
Zip.

I notice the default compression codec is Deflater, but it is writing
headers etc on the .deflate file.  In order to merge Deflate streams into a
Zip you need a few things:
  a) the length of the uncompressed data
  b) the length of the compressed data
  c) the CRC-32 of the uncompressed data

...and the deflated content needs to have been created headerless (e.g. the
no wrap option, and with SYNC_FLUSH mode).

Has anyone here ever seen anyone who has tackled this problem before
please?  Or anyone got any tricks for getting Zips out of text data in HDFS
efficiently?

Thanks,
Tim