You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Robert Fynes <fy...@gmail.com> on 2013/03/30 12:04:24 UTC

Using Hadoop for codec functionality

Hi,

I was wondering if anyone could comment on the suitability of using Hadoop
to run a custom file compression/decompression utility (with functionality
such as zip, gzip, bzip2 etc.).

Re: Using Hadoop for codec functionality

Posted by Robert Fynes <fy...@gmail.com>.

Thanks for both your responses. I was indeed talking about developing a
codec utility as the hadoop application itself.

In particular, thanks to Bertrand for the lengthy response. I'm actually
learning Hadoop at the moment, so I've been trying to find a suitable (very
modestly sized) application for a student project (1-2 weeks max).
I had previously written a codec utility in Perl that uses a combination of
dictionary (LZW) and arithmetic coding techniques. Compression rates aren't
that bad, but it's very slow.

In any case, I just thought that it might be interesting to Hadoop-ify the
program since compression/decompression is compute intensive and could
probably benefit from parallelization.
I'm thinking now that it might not be such a good fit after all.

Also, if anyone reading this has any novel ideas for demonstrating Hadoops
capabilities inside of a short developmental window, I'd love to hear about
it.
At the moment, I'm leaning towards a distributed grep, most likely with
some kind of agrep-like functionality. Not really a searingly inventive
idea, but if anyone can suggest some way I could make it more exciting, I'd
love to hear about that too.

-Rob

On 31 March 2013 10:38, Bertrand Dechoux <de...@gmail.com> wrote:

> Your question could be interpreted in another way : should I use Hadoop in
> order to perform massive compression/decompression using my own
> (eventually, proprietary) utility?
>
> So yes, Hadoop can be used to parallelize the work. But the real answer
> will depend on your context, like always.
> How many files need to be processed? What is the average size? Is your
> utility parallelizable? How the data will be used after
> compression/decompression?
>
> The number of files and their size is important because Hadoop is designed
> to deal with a relatively low number of files but relatively big : a few
> millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
> files. Many small files could become an issue for the performance. But a
> huge files is not necessarily better because if your utility is not
> parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
> a single process to read the whole file and then the uncompressed version
> need to be stored somewhere.
>
> So the final question is : for what purpose? If it is for massive
> decompression, keeping the compressed version inside Hadoop seems a sane
> strategy. So it might be better to rely on a standard compression utility
> and uncompress only before processing inside Hadoop itself. If it is for
> compression, well, it might not be that massive because you might not
> receive that many files at the same time.
>
> The common strategy in Hadoop is not to compress a whole file but instead
> compress the parts (blocks) of the file. This way the size of the
> compression work is limited/bounded and the work can be parallelized even
> with a non parallelizable compression utility. The drawback is that the
> "list of compressed blocks" is not a standard compressed file. And so the
> interoperability with other parts of your system is not granted without
> extra work.
>
> Bertrand
>
>
> On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
> jens.scheidtmann@gmail.com> wrote:
>
>> Dear Robert,
>>
>> SequenceFiles do have either record, block or no compression. You can
>> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>>
>> Best regards,
>>
>> Jens
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: Using Hadoop for codec functionality

Posted by Robert Fynes <fy...@gmail.com>.

Thanks for both your responses. I was indeed talking about developing a
codec utility as the hadoop application itself.

In particular, thanks to Bertrand for the lengthy response. I'm actually
learning Hadoop at the moment, so I've been trying to find a suitable (very
modestly sized) application for a student project (1-2 weeks max).
I had previously written a codec utility in Perl that uses a combination of
dictionary (LZW) and arithmetic coding techniques. Compression rates aren't
that bad, but it's very slow.

In any case, I just thought that it might be interesting to Hadoop-ify the
program since compression/decompression is compute intensive and could
probably benefit from parallelization.
I'm thinking now that it might not be such a good fit after all.

Also, if anyone reading this has any novel ideas for demonstrating Hadoops
capabilities inside of a short developmental window, I'd love to hear about
it.
At the moment, I'm leaning towards a distributed grep, most likely with
some kind of agrep-like functionality. Not really a searingly inventive
idea, but if anyone can suggest some way I could make it more exciting, I'd
love to hear about that too.

-Rob

On 31 March 2013 10:38, Bertrand Dechoux <de...@gmail.com> wrote:

> Your question could be interpreted in another way : should I use Hadoop in
> order to perform massive compression/decompression using my own
> (eventually, proprietary) utility?
>
> So yes, Hadoop can be used to parallelize the work. But the real answer
> will depend on your context, like always.
> How many files need to be processed? What is the average size? Is your
> utility parallelizable? How the data will be used after
> compression/decompression?
>
> The number of files and their size is important because Hadoop is designed
> to deal with a relatively low number of files but relatively big : a few
> millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
> files. Many small files could become an issue for the performance. But a
> huge files is not necessarily better because if your utility is not
> parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
> a single process to read the whole file and then the uncompressed version
> need to be stored somewhere.
>
> So the final question is : for what purpose? If it is for massive
> decompression, keeping the compressed version inside Hadoop seems a sane
> strategy. So it might be better to rely on a standard compression utility
> and uncompress only before processing inside Hadoop itself. If it is for
> compression, well, it might not be that massive because you might not
> receive that many files at the same time.
>
> The common strategy in Hadoop is not to compress a whole file but instead
> compress the parts (blocks) of the file. This way the size of the
> compression work is limited/bounded and the work can be parallelized even
> with a non parallelizable compression utility. The drawback is that the
> "list of compressed blocks" is not a standard compressed file. And so the
> interoperability with other parts of your system is not granted without
> extra work.
>
> Bertrand
>
>
> On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
> jens.scheidtmann@gmail.com> wrote:
>
>> Dear Robert,
>>
>> SequenceFiles do have either record, block or no compression. You can
>> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>>
>> Best regards,
>>
>> Jens
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: Using Hadoop for codec functionality

Posted by Robert Fynes <fy...@gmail.com>.

Thanks for both your responses. I was indeed talking about developing a
codec utility as the hadoop application itself.

In particular, thanks to Bertrand for the lengthy response. I'm actually
learning Hadoop at the moment, so I've been trying to find a suitable (very
modestly sized) application for a student project (1-2 weeks max).
I had previously written a codec utility in Perl that uses a combination of
dictionary (LZW) and arithmetic coding techniques. Compression rates aren't
that bad, but it's very slow.

In any case, I just thought that it might be interesting to Hadoop-ify the
program since compression/decompression is compute intensive and could
probably benefit from parallelization.
I'm thinking now that it might not be such a good fit after all.

Also, if anyone reading this has any novel ideas for demonstrating Hadoops
capabilities inside of a short developmental window, I'd love to hear about
it.
At the moment, I'm leaning towards a distributed grep, most likely with
some kind of agrep-like functionality. Not really a searingly inventive
idea, but if anyone can suggest some way I could make it more exciting, I'd
love to hear about that too.

-Rob

On 31 March 2013 10:38, Bertrand Dechoux <de...@gmail.com> wrote:

> Your question could be interpreted in another way : should I use Hadoop in
> order to perform massive compression/decompression using my own
> (eventually, proprietary) utility?
>
> So yes, Hadoop can be used to parallelize the work. But the real answer
> will depend on your context, like always.
> How many files need to be processed? What is the average size? Is your
> utility parallelizable? How the data will be used after
> compression/decompression?
>
> The number of files and their size is important because Hadoop is designed
> to deal with a relatively low number of files but relatively big : a few
> millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
> files. Many small files could become an issue for the performance. But a
> huge files is not necessarily better because if your utility is not
> parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
> a single process to read the whole file and then the uncompressed version
> need to be stored somewhere.
>
> So the final question is : for what purpose? If it is for massive
> decompression, keeping the compressed version inside Hadoop seems a sane
> strategy. So it might be better to rely on a standard compression utility
> and uncompress only before processing inside Hadoop itself. If it is for
> compression, well, it might not be that massive because you might not
> receive that many files at the same time.
>
> The common strategy in Hadoop is not to compress a whole file but instead
> compress the parts (blocks) of the file. This way the size of the
> compression work is limited/bounded and the work can be parallelized even
> with a non parallelizable compression utility. The drawback is that the
> "list of compressed blocks" is not a standard compressed file. And so the
> interoperability with other parts of your system is not granted without
> extra work.
>
> Bertrand
>
>
> On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
> jens.scheidtmann@gmail.com> wrote:
>
>> Dear Robert,
>>
>> SequenceFiles do have either record, block or no compression. You can
>> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>>
>> Best regards,
>>
>> Jens
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: Using Hadoop for codec functionality

Posted by Robert Fynes <fy...@gmail.com>.

Thanks for both your responses. I was indeed talking about developing a
codec utility as the hadoop application itself.

In particular, thanks to Bertrand for the lengthy response. I'm actually
learning Hadoop at the moment, so I've been trying to find a suitable (very
modestly sized) application for a student project (1-2 weeks max).
I had previously written a codec utility in Perl that uses a combination of
dictionary (LZW) and arithmetic coding techniques. Compression rates aren't
that bad, but it's very slow.

In any case, I just thought that it might be interesting to Hadoop-ify the
program since compression/decompression is compute intensive and could
probably benefit from parallelization.
I'm thinking now that it might not be such a good fit after all.

Also, if anyone reading this has any novel ideas for demonstrating Hadoops
capabilities inside of a short developmental window, I'd love to hear about
it.
At the moment, I'm leaning towards a distributed grep, most likely with
some kind of agrep-like functionality. Not really a searingly inventive
idea, but if anyone can suggest some way I could make it more exciting, I'd
love to hear about that too.

-Rob

On 31 March 2013 10:38, Bertrand Dechoux <de...@gmail.com> wrote:

> Your question could be interpreted in another way : should I use Hadoop in
> order to perform massive compression/decompression using my own
> (eventually, proprietary) utility?
>
> So yes, Hadoop can be used to parallelize the work. But the real answer
> will depend on your context, like always.
> How many files need to be processed? What is the average size? Is your
> utility parallelizable? How the data will be used after
> compression/decompression?
>
> The number of files and their size is important because Hadoop is designed
> to deal with a relatively low number of files but relatively big : a few
> millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
> files. Many small files could become an issue for the performance. But a
> huge files is not necessarily better because if your utility is not
> parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
> a single process to read the whole file and then the uncompressed version
> need to be stored somewhere.
>
> So the final question is : for what purpose? If it is for massive
> decompression, keeping the compressed version inside Hadoop seems a sane
> strategy. So it might be better to rely on a standard compression utility
> and uncompress only before processing inside Hadoop itself. If it is for
> compression, well, it might not be that massive because you might not
> receive that many files at the same time.
>
> The common strategy in Hadoop is not to compress a whole file but instead
> compress the parts (blocks) of the file. This way the size of the
> compression work is limited/bounded and the work can be parallelized even
> with a non parallelizable compression utility. The drawback is that the
> "list of compressed blocks" is not a standard compressed file. And so the
> interoperability with other parts of your system is not granted without
> extra work.
>
> Bertrand
>
>
> On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
> jens.scheidtmann@gmail.com> wrote:
>
>> Dear Robert,
>>
>> SequenceFiles do have either record, block or no compression. You can
>> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>>
>> Best regards,
>>
>> Jens
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: Using Hadoop for codec functionality

Posted by Bertrand Dechoux <de...@gmail.com>.

Your question could be interpreted in another way : should I use Hadoop in
order to perform massive compression/decompression using my own
(eventually, proprietary) utility?

So yes, Hadoop can be used to parallelize the work. But the real answer
will depend on your context, like always.
How many files need to be processed? What is the average size? Is your
utility parallelizable? How the data will be used after
compression/decompression?

The number of files and their size is important because Hadoop is designed
to deal with a relatively low number of files but relatively big : a few
millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
files. Many small files could become an issue for the performance. But a
huge files is not necessarily better because if your utility is not
parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
a single process to read the whole file and then the uncompressed version
need to be stored somewhere.

So the final question is : for what purpose? If it is for massive
decompression, keeping the compressed version inside Hadoop seems a sane
strategy. So it might be better to rely on a standard compression utility
and uncompress only before processing inside Hadoop itself. If it is for
compression, well, it might not be that massive because you might not
receive that many files at the same time.

The common strategy in Hadoop is not to compress a whole file but instead
compress the parts (blocks) of the file. This way the size of the
compression work is limited/bounded and the work can be parallelized even
with a non parallelizable compression utility. The drawback is that the
"list of compressed blocks" is not a standard compressed file. And so the
interoperability with other parts of your system is not granted without
extra work.

Bertrand

On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Dear Robert,
>
> SequenceFiles do have either record, block or no compression. You can
> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>
> Best regards,
>
> Jens
>
>
>

-- 
Bertrand Dechoux

Re: Using Hadoop for codec functionality

Posted by Bertrand Dechoux <de...@gmail.com>.

Your question could be interpreted in another way : should I use Hadoop in
order to perform massive compression/decompression using my own
(eventually, proprietary) utility?

So yes, Hadoop can be used to parallelize the work. But the real answer
will depend on your context, like always.
How many files need to be processed? What is the average size? Is your
utility parallelizable? How the data will be used after
compression/decompression?

The number of files and their size is important because Hadoop is designed
to deal with a relatively low number of files but relatively big : a few
millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
files. Many small files could become an issue for the performance. But a
huge files is not necessarily better because if your utility is not
parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
a single process to read the whole file and then the uncompressed version
need to be stored somewhere.

So the final question is : for what purpose? If it is for massive
decompression, keeping the compressed version inside Hadoop seems a sane
strategy. So it might be better to rely on a standard compression utility
and uncompress only before processing inside Hadoop itself. If it is for
compression, well, it might not be that massive because you might not
receive that many files at the same time.

The common strategy in Hadoop is not to compress a whole file but instead
compress the parts (blocks) of the file. This way the size of the
compression work is limited/bounded and the work can be parallelized even
with a non parallelizable compression utility. The drawback is that the
"list of compressed blocks" is not a standard compressed file. And so the
interoperability with other parts of your system is not granted without
extra work.

Bertrand

On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Dear Robert,
>
> SequenceFiles do have either record, block or no compression. You can
> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>
> Best regards,
>
> Jens
>
>
>

-- 
Bertrand Dechoux

Re: Using Hadoop for codec functionality

Posted by Bertrand Dechoux <de...@gmail.com>.

Your question could be interpreted in another way : should I use Hadoop in
order to perform massive compression/decompression using my own
(eventually, proprietary) utility?

So yes, Hadoop can be used to parallelize the work. But the real answer
will depend on your context, like always.
How many files need to be processed? What is the average size? Is your
utility parallelizable? How the data will be used after
compression/decompression?

The number of files and their size is important because Hadoop is designed
to deal with a relatively low number of files but relatively big : a few
millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
files. Many small files could become an issue for the performance. But a
huge files is not necessarily better because if your utility is not
parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
a single process to read the whole file and then the uncompressed version
need to be stored somewhere.

So the final question is : for what purpose? If it is for massive
decompression, keeping the compressed version inside Hadoop seems a sane
strategy. So it might be better to rely on a standard compression utility
and uncompress only before processing inside Hadoop itself. If it is for
compression, well, it might not be that massive because you might not
receive that many files at the same time.

The common strategy in Hadoop is not to compress a whole file but instead
compress the parts (blocks) of the file. This way the size of the
compression work is limited/bounded and the work can be parallelized even
with a non parallelizable compression utility. The drawback is that the
"list of compressed blocks" is not a standard compressed file. And so the
interoperability with other parts of your system is not granted without
extra work.

Bertrand

On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Dear Robert,
>
> SequenceFiles do have either record, block or no compression. You can
> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>
> Best regards,
>
> Jens
>
>
>

-- 
Bertrand Dechoux

Re: Using Hadoop for codec functionality

Posted by Bertrand Dechoux <de...@gmail.com>.

Your question could be interpreted in another way : should I use Hadoop in
order to perform massive compression/decompression using my own
(eventually, proprietary) utility?

So yes, Hadoop can be used to parallelize the work. But the real answer
will depend on your context, like always.
How many files need to be processed? What is the average size? Is your
utility parallelizable? How the data will be used after
compression/decompression?

The number of files and their size is important because Hadoop is designed
to deal with a relatively low number of files but relatively big : a few
millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
files. Many small files could become an issue for the performance. But a
huge files is not necessarily better because if your utility is not
parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
a single process to read the whole file and then the uncompressed version
need to be stored somewhere.

So the final question is : for what purpose? If it is for massive
decompression, keeping the compressed version inside Hadoop seems a sane
strategy. So it might be better to rely on a standard compression utility
and uncompress only before processing inside Hadoop itself. If it is for
compression, well, it might not be that massive because you might not
receive that many files at the same time.

The common strategy in Hadoop is not to compress a whole file but instead
compress the parts (blocks) of the file. This way the size of the
compression work is limited/bounded and the work can be parallelized even
with a non parallelizable compression utility. The drawback is that the
"list of compressed blocks" is not a standard compressed file. And so the
interoperability with other parts of your system is not granted without
extra work.

Bertrand

On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
jens.scheidtmann@gmail.com> wrote:

> Dear Robert,
>
> SequenceFiles do have either record, block or no compression. You can
> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>
> Best regards,
>
> Jens
>
>
>

-- 
Bertrand Dechoux

Re: Using Hadoop for codec functionality

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Robert,

SequenceFiles do have either record, block or no compression. You can
configure, which codec (gzip, bzip2, etc.) is used. Have a look at
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html

Best regards,

Jens

Re: Using Hadoop for codec functionality

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Robert,

SequenceFiles do have either record, block or no compression. You can
configure, which codec (gzip, bzip2, etc.) is used. Have a look at
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html

Best regards,

Jens

Re: Using Hadoop for codec functionality

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Robert,

SequenceFiles do have either record, block or no compression. You can
configure, which codec (gzip, bzip2, etc.) is used. Have a look at
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html

Best regards,

Jens

Re: Using Hadoop for codec functionality

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Robert,

SequenceFiles do have either record, block or no compression. You can
configure, which codec (gzip, bzip2, etc.) is used. Have a look at
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html

Best regards,

Jens