You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Robert Fynes <fy...@gmail.com> on 2013/04/02 17:38:44 UTC
Re: Using Hadoop for codec functionality

Thanks for both your responses. I was indeed talking about developing a
codec utility as the hadoop application itself.

In particular, thanks to Bertrand for the lengthy response. I'm actually
learning Hadoop at the moment, so I've been trying to find a suitable (very
modestly sized) application for a student project (1-2 weeks max).
I had previously written a codec utility in Perl that uses a combination of
dictionary (LZW) and arithmetic coding techniques. Compression rates aren't
that bad, but it's very slow.

In any case, I just thought that it might be interesting to Hadoop-ify the
program since compression/decompression is compute intensive and could
probably benefit from parallelization.
I'm thinking now that it might not be such a good fit after all.

Also, if anyone reading this has any novel ideas for demonstrating Hadoops
capabilities inside of a short developmental window, I'd love to hear about
it.
At the moment, I'm leaning towards a distributed grep, most likely with
some kind of agrep-like functionality. Not really a searingly inventive
idea, but if anyone can suggest some way I could make it more exciting, I'd
love to hear about that too.

-Rob


On 31 March 2013 10:38, Bertrand Dechoux <de...@gmail.com> wrote:

> Your question could be interpreted in another way : should I use Hadoop in
> order to perform massive compression/decompression using my own
> (eventually, proprietary) utility?
>
> So yes, Hadoop can be used to parallelize the work. But the real answer
> will depend on your context, like always.
> How many files need to be processed? What is the average size? Is your
> utility parallelizable? How the data will be used after
> compression/decompression?
>
> The number of files and their size is important because Hadoop is designed
> to deal with a relatively low number of files but relatively big : a few
> millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
> files. Many small files could become an issue for the performance. But a
> huge files is not necessarily better because if your utility is not
> parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
> a single process to read the whole file and then the uncompressed version
> need to be stored somewhere.
>
> So the final question is : for what purpose? If it is for massive
> decompression, keeping the compressed version inside Hadoop seems a sane
> strategy. So it might be better to rely on a standard compression utility
> and uncompress only before processing inside Hadoop itself. If it is for
> compression, well, it might not be that massive because you might not
> receive that many files at the same time.
>
> The common strategy in Hadoop is not to compress a whole file but instead
> compress the parts (blocks) of the file. This way the size of the
> compression work is limited/bounded and the work can be parallelized even
> with a non parallelizable compression utility. The drawback is that the
> "list of compressed blocks" is not a standard compressed file. And so the
> interoperability with other parts of your system is not granted without
> extra work.
>
> Bertrand
>
>
> On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
> jens.scheidtmann@gmail.com> wrote:
>
>> Dear Robert,
>>
>> SequenceFiles do have either record, block or no compression. You can
>> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>>
>> Best regards,
>>
>> Jens
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>