You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Grzegorz Gunia <sa...@student.agh.edu.pl> on 2012/04/11 09:55:59 UTC

CompressionCodec in MapReduce

Hello,
I am trying to apply a custom CompressionCodec to work with MapReduce 
jobs, but I haven't found a way to inject it during the reading of input 
data, or during the write of the job results.
Am I missing something, or is there no support for compressed files in 
the filesystem?

I am well aware of how to set it up to work during the intermitent 
phases of the MapReduce operation, but I just can't find a way to apply 
it BEFORE the job takes place...
Is there any other way except simply uncompressing the files I need 
prior to scheduling a job?

Huge thanks for any help you can give me!
--
Greg

RE: CompressionCodec in MapReduce

Posted by Devaraj k <de...@huawei.com>.

Hi Grzegorz,

    You can find the below properties for Job input and output compression:

The below prop is used by the codec factory. This codec will be taken based on the type(i.e suffix) of the file. By default the LineRecordReador which is used by FileInputFormat uses this. If you want the compression for inputs in otherway you can write input format according to that.

core-site.xml:
---------------

<property> 
  <name>io.compression.codecs</name> 
  <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value> 
  <description>A list of the compression codec classes that can be used 
               for compression/decompression.</description> 
</property> 


   I am not sure which version of hadoop you are using. I am giving the props for newer and older versions. These are the props you need to configure if you want to compress job outputs. These works only when the output format is FileOutputFormat.

mapred-site.xml:(for version 0.23  and later)
---------------------------------------------------

<property> 
  <name>mapreduce.output.fileoutputformat.compress</name> 
  <value>false</value> 
  <description>Should the job outputs be compressed? 
  </description> 
</property> 

<property> 
  <name>mapreduce.output.fileoutputformat.compression.type</name> 
  <value>RECORD</value> 
  <description>If the job outputs are to compressed as SequenceFiles, how should 
               they be compressed? Should be one of NONE, RECORD or BLOCK. 
  </description> 
</property> 

<property> 
  <name>mapreduce.output.fileoutputformat.compression.codec</name> 
  <value>org.apache.hadoop.io.compress.DefaultCodec</value> 
  <description>If the job outputs are compressed, how should they be compressed? 
  </description> 
</property> 




mapred-site.xml:(for older versions)
------------------------------------------

<property> 
  <name>mapred.output.compress</name> 
  <value>false</value> 
  <description>Should the job outputs be compressed? 
  </description> 
</property> 

<property> 
  <name>mapred.output.compression.type</name> 
  <value>RECORD</value> 
  <description>If the job outputs are to compressed as SequenceFiles, how should 
               they be compressed? Should be one of NONE, RECORD or BLOCK. 
  </description> 
</property> 

<property> 
  <name>mapred.output.compression.codec</name> 
  <value>org.apache.hadoop.io.compress.DefaultCodec</value> 
  <description>If the job outputs are compressed, how should they be compressed? 
  </description> 
</property> 


If you want to use compression with your custom input and out formats, you can implement the compression in those classes.


Thanks
Devaraj
________________________________________
From: Grzegorz Gunia [sawtyss@student.agh.edu.pl]
Sent: Wednesday, April 11, 2012 1:46 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: CompressionCodec in MapReduce

Thanks for you reply! That clears some thing up
There is but one problem... My CompressionCodec has to be instantiated on a per-file basis, meaning it needs to know the name of the file it is to compress/decompress. I'm guessing that would not be possible with the current implementation?

Or if so, how would I proceed with injecting it with the file name?
--
Greg

W dniu 2012-04-11 10:12, Zizon Qiu pisze:
append your custom codec full class name in "io.compression.codecs" either in mapred-site.xml or in the configuration object pass to Job constructor.

the map reduce framework will try to guess the compress algorithm using the input files suffix.

if any CompressionCodec.getDefaultExtension() register in the configuration match the suffix,hadoop will try to instantiate the codec and decompress for you ,if succeed,automatically.

the default value for "io.compression.codecs" is "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"

On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <sa...@student.agh.edu.pl>> wrote:
Hello,
I am trying to apply a custom CompressionCodec to work with MapReduce jobs, but I haven't found a way to inject it during the reading of input data, or during the write of the job results.
Am I missing something, or is there no support for compressed files in the filesystem?

I am well aware of how to set it up to work during the intermitent phases of the MapReduce operation, but I just can't find a way to apply it BEFORE the job takes place...
Is there any other way except simply uncompressing the files I need prior to scheduling a job?

Huge thanks for any help you can give me!
--
Greg

Re: CompressionCodec in MapReduce

Posted by Zizon Qiu <zz...@gmail.com>.

It is possible but a little tricky.

As I mention before,write a custom InputFormat and the associate
RecordReader.

On Wed, Apr 11, 2012 at 5:23 PM, Grzegorz Gunia
<sa...@student.agh.edu.pl>wrote:

>  I think we misunderstood here.
>
> I'll base my question upon an example:
> Lets say I want each of the files stored on my hdfs to be encrypted prior
> to being physically stored on the cluster.
> For that I'll write a custom CompressionCodec, that performs the
> encryption, and use it during any edits/creations of files in the HDFS.
> Then to make it more secure I'll make it so it uses different keys for
> different files, and supply the keys to the codec during its instantiation.
>
> Now I'd like to do a MapReduce job on those files. That would require
> instantiating the codec, and supplying it with the filename, to determine
> the key used. Is it possible to do so with the current implementation of
> Hadoop?
>
> --
> Greg
>
> W dniu 2012-04-11 10:44, Zizon Qiu pisze:
>
> If your are:
> 1. using TextInputFormat.
> 2.all input files are ends with certain suffix like ".gz"
> 3.the custom CompressionCodec already register  in configuration and
> getDefaultExtension return the same suffix like as describe in 2.
>
>  the nothing else you need to do.
> hadoop will deal with it automatically.
>
>  that means the input key&value in map method are already decompress.
>
>  But,if the origin files dose not end with certain suffix,you need
> to write your own inputformat or subclass TextInputFormat , override the
> createRecordReader method which return your own RecordReader.
> the InputSplit pass to the InputFormat is actually FileInputSplit,which
> you can retrieve the input file path.
>
>  you may also take a look at the isSplitable method declared
> in InputFormat,if your files are not splitable.
>
>  for more detail,refer to the TextInputFormat class implementation.
>
> On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia <
> sawtyss@student.agh.edu.pl> wrote:
>
>>  Thanks for you reply! That clears some thing up
>> There is but one problem... My CompressionCodec has to be instantiated on
>> a per-file basis, meaning it needs to know the name of the file it is to
>> compress/decompress. I'm guessing that would not be possible with the
>> current implementation?
>>
>> Or if so, how would I proceed with injecting it with the file name?
>> --
>> Greg
>>
>> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>>
>> append your custom codec full class name in "io.compression.codecs"
>> either in mapred-site.xml or in the configuration object pass to Job
>> constructor.
>>
>>  the map reduce framework will try to guess the compress algorithm using
>> the input files suffix.
>>
>>  if any CompressionCodec.getDefaultExtension() register in the
>> configuration match the suffix,hadoop will try to instantiate the codec and
>> decompress for you ,if succeed,automatically.
>>
>>  the default value for "io.compression.codecs" is
>> "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>>
>> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <
>> sawtyss@student.agh.edu.pl> wrote:
>>
>>> Hello,
>>> I am trying to apply a custom CompressionCodec to work with MapReduce
>>> jobs, but I haven't found a way to inject it during the reading of input
>>> data, or during the write of the job results.
>>> Am I missing something, or is there no support for compressed files in
>>> the filesystem?
>>>
>>> I am well aware of how to set it up to work during the intermitent
>>> phases of the MapReduce operation, but I just can't find a way to apply it
>>> BEFORE the job takes place...
>>> Is there any other way except simply uncompressing the files I need
>>> prior to scheduling a job?
>>>
>>> Huge thanks for any help you can give me!
>>> --
>>> Greg
>>>
>>
>>
>>
>
>

Re: CompressionCodec in MapReduce

Posted by Arun C Murthy <ac...@hortonworks.com>.

You can write your own InputFormat (IF) which extends FileInputFormat.

In your IF you get the InputSplit which has the filename during the call to getRecordReader. That is the hook you are looking for.

More details here:
http://hadoop.apache.org/common/docs/r1.0.2/mapred_tutorial.html#Job+Input

hth,
Arun

On Apr 11, 2012, at 2:53 PM, Grzegorz Gunia wrote:

> I think we misunderstood here.
> 
> I'll base my question upon an example:
> Lets say I want each of the files stored on my hdfs to be encrypted prior to being physically stored on the cluster.
> For that I'll write a custom CompressionCodec, that performs the encryption, and use it during any edits/creations of files in the HDFS.
> Then to make it more secure I'll make it so it uses different keys for different files, and supply the keys to the codec during its instantiation.
> 
> Now I'd like to do a MapReduce job on those files. That would require instantiating the codec, and supplying it with the filename, to determine the key used. Is it possible to do so with the current implementation of Hadoop?
> 
> --
> Greg
> 
> W dniu 2012-04-11 10:44, Zizon Qiu pisze:
>> 
>> If your are:
>> 1. using TextInputFormat.
>> 2.all input files are ends with certain suffix like ".gz"
>> 3.the custom CompressionCodec already register  in configuration and getDefaultExtension return the same suffix like as describe in 2.
>> 
>> the nothing else you need to do.
>> hadoop will deal with it automatically.
>> 
>> that means the input key&value in map method are already decompress.
>> 
>> But,if the origin files dose not end with certain suffix,you need to write your own inputformat or subclass TextInputFormat , override the createRecordReader method which return your own RecordReader.
>> the InputSplit pass to the InputFormat is actually FileInputSplit,which you can retrieve the input file path.
>> 
>> you may also take a look at the isSplitable method declared in InputFormat,if your files are not splitable.
>> 
>> for more detail,refer to the TextInputFormat class implementation.
>> 
>> On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia <sa...@student.agh.edu.pl> wrote:
>> Thanks for you reply! That clears some thing up
>> There is but one problem... My CompressionCodec has to be instantiated on a per-file basis, meaning it needs to know the name of the file it is to compress/decompress. I'm guessing that would not be possible with the current implementation?
>> 
>> Or if so, how would I proceed with injecting it with the file name?
>> --
>> Greg
>> 
>> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>>> append your custom codec full class name in "io.compression.codecs" either in mapred-site.xml or in the configuration object pass to Job constructor.
>>> 
>>> the map reduce framework will try to guess the compress algorithm using the input files suffix.
>>> 
>>> if any CompressionCodec.getDefaultExtension() register in the configuration match the suffix,hadoop will try to instantiate the codec and decompress for you ,if succeed,automatically.
>>> 
>>> the default value for "io.compression.codecs" is "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>>> 
>>> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <sa...@student.agh.edu.pl> wrote:
>>> Hello,
>>> I am trying to apply a custom CompressionCodec to work with MapReduce jobs, but I haven't found a way to inject it during the reading of input data, or during the write of the job results.
>>> Am I missing something, or is there no support for compressed files in the filesystem?
>>> 
>>> I am well aware of how to set it up to work during the intermitent phases of the MapReduce operation, but I just can't find a way to apply it BEFORE the job takes place...
>>> Is there any other way except simply uncompressing the files I need prior to scheduling a job?
>>> 
>>> Huge thanks for any help you can give me!
>>> --
>>> Greg
>>> 
>> 
>> 
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: CompressionCodec in MapReduce

Posted by Grzegorz Gunia <sa...@student.agh.edu.pl>.

I think we misunderstood here.

I'll base my question upon an example:
Lets say I want each of the files stored on my hdfs to be encrypted 
prior to being physically stored on the cluster.
For that I'll write a custom CompressionCodec, that performs the 
encryption, and use it during any edits/creations of files in the HDFS.
Then to make it more secure I'll make it so it uses different keys for 
different files, and supply the keys to the codec during its instantiation.

Now I'd like to do a MapReduce job on those files. That would require 
instantiating the codec, and supplying it with the filename, to 
determine the key used. Is it possible to do so with the current 
implementation of Hadoop?

--
Greg

W dniu 2012-04-11 10:44, Zizon Qiu pisze:
> If your are:
> 1. using TextInputFormat.
> 2.all input files are ends with certain suffix like ".gz"
> 3.the custom CompressionCodec already register  in configuration and 
> getDefaultExtension return the same suffix like as describe in 2.
>
> the nothing else you need to do.
> hadoop will deal with it automatically.
>
> that means the input key&value in map method are already decompress.
>
> But,if the origin files dose not end with certain suffix,you need 
> to write your own inputformat or subclass TextInputFormat , override 
> the createRecordReader method which return your own RecordReader.
> the InputSplit pass to the InputFormat is actually 
> FileInputSplit,which you can retrieve the input file path.
>
> you may also take a look at the isSplitable method declared 
> in InputFormat,if your files are not splitable.
>
> for more detail,refer to the TextInputFormat class implementation.
>
> On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia 
> <sawtyss@student.agh.edu.pl <ma...@student.agh.edu.pl>> wrote:
>
>     Thanks for you reply! That clears some thing up
>     There is but one problem... My CompressionCodec has to be
>     instantiated on a per-file basis, meaning it needs to know the
>     name of the file it is to compress/decompress. I'm guessing that
>     would not be possible with the current implementation?
>
>     Or if so, how would I proceed with injecting it with the file name?
>     --
>     Greg
>
>     W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>>     append your custom codec full class name in
>>     "io.compression.codecs" either in mapred-site.xml or in the
>>     configuration object pass to Job constructor.
>>
>>     the map reduce framework will try to guess the compress algorithm
>>     using the input files suffix.
>>
>>     if any CompressionCodec.getDefaultExtension() register in the
>>     configuration match the suffix,hadoop will try to instantiate the
>>     codec and decompress for you ,if succeed,automatically.
>>
>>     the default value for "io.compression.codecs" is
>>     "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>>
>>     On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia
>>     <sawtyss@student.agh.edu.pl <ma...@student.agh.edu.pl>>
>>     wrote:
>>
>>         Hello,
>>         I am trying to apply a custom CompressionCodec to work with
>>         MapReduce jobs, but I haven't found a way to inject it during
>>         the reading of input data, or during the write of the job
>>         results.
>>         Am I missing something, or is there no support for compressed
>>         files in the filesystem?
>>
>>         I am well aware of how to set it up to work during the
>>         intermitent phases of the MapReduce operation, but I just
>>         can't find a way to apply it BEFORE the job takes place...
>>         Is there any other way except simply uncompressing the files
>>         I need prior to scheduling a job?
>>
>>         Huge thanks for any help you can give me!
>>         --
>>         Greg
>>
>>
>
>

Re: CompressionCodec in MapReduce

Posted by Zizon Qiu <zz...@gmail.com>.

If your are:
1. using TextInputFormat.
2.all input files are ends with certain suffix like ".gz"
3.the custom CompressionCodec already register  in configuration and
getDefaultExtension return the same suffix like as describe in 2.

the nothing else you need to do.
hadoop will deal with it automatically.

that means the input key&value in map method are already decompress.

But,if the origin files dose not end with certain suffix,you need to write
your own inputformat or subclass TextInputFormat , override the
createRecordReader method which return your own RecordReader.
the InputSplit pass to the InputFormat is actually FileInputSplit,which you
can retrieve the input file path.

you may also take a look at the isSplitable method declared
in InputFormat,if your files are not splitable.

for more detail,refer to the TextInputFormat class implementation.

On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia
<sa...@student.agh.edu.pl>wrote:

>  Thanks for you reply! That clears some thing up
> There is but one problem... My CompressionCodec has to be instantiated on
> a per-file basis, meaning it needs to know the name of the file it is to
> compress/decompress. I'm guessing that would not be possible with the
> current implementation?
>
> Or if so, how would I proceed with injecting it with the file name?
> --
> Greg
>
> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>
> append your custom codec full class name in "io.compression.codecs" either
> in mapred-site.xml or in the configuration object pass to Job constructor.
>
>  the map reduce framework will try to guess the compress algorithm using
> the input files suffix.
>
>  if any CompressionCodec.getDefaultExtension() register in the
> configuration match the suffix,hadoop will try to instantiate the codec and
> decompress for you ,if succeed,automatically.
>
>  the default value for "io.compression.codecs" is
> "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>
> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <
> sawtyss@student.agh.edu.pl> wrote:
>
>> Hello,
>> I am trying to apply a custom CompressionCodec to work with MapReduce
>> jobs, but I haven't found a way to inject it during the reading of input
>> data, or during the write of the job results.
>> Am I missing something, or is there no support for compressed files in
>> the filesystem?
>>
>> I am well aware of how to set it up to work during the intermitent phases
>> of the MapReduce operation, but I just can't find a way to apply it BEFORE
>> the job takes place...
>> Is there any other way except simply uncompressing the files I need prior
>> to scheduling a job?
>>
>> Huge thanks for any help you can give me!
>> --
>> Greg
>>
>
>
>

Re: CompressionCodec in MapReduce

Posted by Grzegorz Gunia <sa...@student.agh.edu.pl>.

Thanks for you reply! That clears some thing up
There is but one problem... My CompressionCodec has to be instantiated 
on a per-file basis, meaning it needs to know the name of the file it is 
to compress/decompress. I'm guessing that would not be possible with the 
current implementation?

Or if so, how would I proceed with injecting it with the file name?
--
Greg

W dniu 2012-04-11 10:12, Zizon Qiu pisze:
> append your custom codec full class name in "io.compression.codecs" 
> either in mapred-site.xml or in the configuration object pass to Job 
> constructor.
>
> the map reduce framework will try to guess the compress algorithm 
> using the input files suffix.
>
> if any CompressionCodec.getDefaultExtension() register in the 
> configuration match the suffix,hadoop will try to instantiate the 
> codec and decompress for you ,if succeed,automatically.
>
> the default value for "io.compression.codecs" is 
> "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>
> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia 
> <sawtyss@student.agh.edu.pl <ma...@student.agh.edu.pl>> wrote:
>
>     Hello,
>     I am trying to apply a custom CompressionCodec to work with
>     MapReduce jobs, but I haven't found a way to inject it during the
>     reading of input data, or during the write of the job results.
>     Am I missing something, or is there no support for compressed
>     files in the filesystem?
>
>     I am well aware of how to set it up to work during the intermitent
>     phases of the MapReduce operation, but I just can't find a way to
>     apply it BEFORE the job takes place...
>     Is there any other way except simply uncompressing the files I
>     need prior to scheduling a job?
>
>     Huge thanks for any help you can give me!
>     --
>     Greg
>
>

Re: CompressionCodec in MapReduce

Posted by Zizon Qiu <zz...@gmail.com>.

append your custom codec full class name in "io.compression.codecs" either
in mapred-site.xml or in the configuration object pass to Job constructor.

the map reduce framework will try to guess the compress algorithm using the
input files suffix.

if any CompressionCodec.getDefaultExtension() register in the configuration
match the suffix,hadoop will try to instantiate the codec and decompress
for you ,if succeed,automatically.

the default value for "io.compression.codecs" is
"org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"

On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia
<sa...@student.agh.edu.pl>wrote:

> Hello,
> I am trying to apply a custom CompressionCodec to work with MapReduce
> jobs, but I haven't found a way to inject it during the reading of input
> data, or during the write of the job results.
> Am I missing something, or is there no support for compressed files in the
> filesystem?
>
> I am well aware of how to set it up to work during the intermitent phases
> of the MapReduce operation, but I just can't find a way to apply it BEFORE
> the job takes place...
> Is there any other way except simply uncompressing the files I need prior
> to scheduling a job?
>
> Huge thanks for any help you can give me!
> --
> Greg
>