You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Grzegorz Gunia <sa...@student.agh.edu.pl> on 2012/04/11 09:55:59 UTC
CompressionCodec in MapReduce
Hello,
I am trying to apply a custom CompressionCodec to work with MapReduce
jobs, but I haven't found a way to inject it during the reading of input
data, or during the write of the job results.
Am I missing something, or is there no support for compressed files in
the filesystem?
I am well aware of how to set it up to work during the intermitent
phases of the MapReduce operation, but I just can't find a way to apply
it BEFORE the job takes place...
Is there any other way except simply uncompressing the files I need
prior to scheduling a job?
Huge thanks for any help you can give me!
--
Greg
RE: CompressionCodec in MapReduce
Posted by Devaraj k <de...@huawei.com>.
Hi Grzegorz,
You can find the below properties for Job input and output compression:
The below prop is used by the codec factory. This codec will be taken based on the type(i.e suffix) of the file. By default the LineRecordReador which is used by FileInputFormat uses this. If you want the compression for inputs in otherway you can write input format according to that.
core-site.xml:
---------------
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value>
<description>A list of the compression codec classes that can be used
for compression/decompression.</description>
</property>
I am not sure which version of hadoop you are using. I am giving the props for newer and older versions. These are the props you need to configure if you want to compress job outputs. These works only when the output format is FileOutputFormat.
mapred-site.xml:(for version 0.23 and later)
---------------------------------------------------
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>false</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compression.type</name>
<value>RECORD</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compression.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
mapred-site.xml:(for older versions)
------------------------------------------
<property>
<name>mapred.output.compress</name>
<value>false</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>RECORD</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
If you want to use compression with your custom input and out formats, you can implement the compression in those classes.
Thanks
Devaraj
________________________________________
From: Grzegorz Gunia [sawtyss@student.agh.edu.pl]
Sent: Wednesday, April 11, 2012 1:46 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: CompressionCodec in MapReduce
Thanks for you reply! That clears some thing up
There is but one problem... My CompressionCodec has to be instantiated on a per-file basis, meaning it needs to know the name of the file it is to compress/decompress. I'm guessing that would not be possible with the current implementation?
Or if so, how would I proceed with injecting it with the file name?
--
Greg
W dniu 2012-04-11 10:12, Zizon Qiu pisze:
append your custom codec full class name in "io.compression.codecs" either in mapred-site.xml or in the configuration object pass to Job constructor.
the map reduce framework will try to guess the compress algorithm using the input files suffix.
if any CompressionCodec.getDefaultExtension() register in the configuration match the suffix,hadoop will try to instantiate the codec and decompress for you ,if succeed,automatically.
the default value for "io.compression.codecs" is "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <sa...@student.agh.edu.pl>> wrote:
Hello,
I am trying to apply a custom CompressionCodec to work with MapReduce jobs, but I haven't found a way to inject it during the reading of input data, or during the write of the job results.
Am I missing something, or is there no support for compressed files in the filesystem?
I am well aware of how to set it up to work during the intermitent phases of the MapReduce operation, but I just can't find a way to apply it BEFORE the job takes place...
Is there any other way except simply uncompressing the files I need prior to scheduling a job?
Huge thanks for any help you can give me!
--
Greg
Re: CompressionCodec in MapReduce
Posted by Zizon Qiu <zz...@gmail.com>.
It is possible but a little tricky.
As I mention before,write a custom InputFormat and the associate
RecordReader.
On Wed, Apr 11, 2012 at 5:23 PM, Grzegorz Gunia
<sa...@student.agh.edu.pl>wrote:
> I think we misunderstood here.
>
> I'll base my question upon an example:
> Lets say I want each of the files stored on my hdfs to be encrypted prior
> to being physically stored on the cluster.
> For that I'll write a custom CompressionCodec, that performs the
> encryption, and use it during any edits/creations of files in the HDFS.
> Then to make it more secure I'll make it so it uses different keys for
> different files, and supply the keys to the codec during its instantiation.
>
> Now I'd like to do a MapReduce job on those files. That would require
> instantiating the codec, and supplying it with the filename, to determine
> the key used. Is it possible to do so with the current implementation of
> Hadoop?
>
> --
> Greg
>
> W dniu 2012-04-11 10:44, Zizon Qiu pisze:
>
> If your are:
> 1. using TextInputFormat.
> 2.all input files are ends with certain suffix like ".gz"
> 3.the custom CompressionCodec already register in configuration and
> getDefaultExtension return the same suffix like as describe in 2.
>
> the nothing else you need to do.
> hadoop will deal with it automatically.
>
> that means the input key&value in map method are already decompress.
>
> But,if the origin files dose not end with certain suffix,you need
> to write your own inputformat or subclass TextInputFormat , override the
> createRecordReader method which return your own RecordReader.
> the InputSplit pass to the InputFormat is actually FileInputSplit,which
> you can retrieve the input file path.
>
> you may also take a look at the isSplitable method declared
> in InputFormat,if your files are not splitable.
>
> for more detail,refer to the TextInputFormat class implementation.
>
> On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia <
> sawtyss@student.agh.edu.pl> wrote:
>
>> Thanks for you reply! That clears some thing up
>> There is but one problem... My CompressionCodec has to be instantiated on
>> a per-file basis, meaning it needs to know the name of the file it is to
>> compress/decompress. I'm guessing that would not be possible with the
>> current implementation?
>>
>> Or if so, how would I proceed with injecting it with the file name?
>> --
>> Greg
>>
>> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>>
>> append your custom codec full class name in "io.compression.codecs"
>> either in mapred-site.xml or in the configuration object pass to Job
>> constructor.
>>
>> the map reduce framework will try to guess the compress algorithm using
>> the input files suffix.
>>
>> if any CompressionCodec.getDefaultExtension() register in the
>> configuration match the suffix,hadoop will try to instantiate the codec and
>> decompress for you ,if succeed,automatically.
>>
>> the default value for "io.compression.codecs" is
>> "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>>
>> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <
>> sawtyss@student.agh.edu.pl> wrote:
>>
>>> Hello,
>>> I am trying to apply a custom CompressionCodec to work with MapReduce
>>> jobs, but I haven't found a way to inject it during the reading of input
>>> data, or during the write of the job results.
>>> Am I missing something, or is there no support for compressed files in
>>> the filesystem?
>>>
>>> I am well aware of how to set it up to work during the intermitent
>>> phases of the MapReduce operation, but I just can't find a way to apply it
>>> BEFORE the job takes place...
>>> Is there any other way except simply uncompressing the files I need
>>> prior to scheduling a job?
>>>
>>> Huge thanks for any help you can give me!
>>> --
>>> Greg
>>>
>>
>>
>>
>
>
Re: CompressionCodec in MapReduce
Posted by Arun C Murthy <ac...@hortonworks.com>.
You can write your own InputFormat (IF) which extends FileInputFormat.
In your IF you get the InputSplit which has the filename during the call to getRecordReader. That is the hook you are looking for.
More details here:
http://hadoop.apache.org/common/docs/r1.0.2/mapred_tutorial.html#Job+Input
hth,
Arun
On Apr 11, 2012, at 2:53 PM, Grzegorz Gunia wrote:
> I think we misunderstood here.
>
> I'll base my question upon an example:
> Lets say I want each of the files stored on my hdfs to be encrypted prior to being physically stored on the cluster.
> For that I'll write a custom CompressionCodec, that performs the encryption, and use it during any edits/creations of files in the HDFS.
> Then to make it more secure I'll make it so it uses different keys for different files, and supply the keys to the codec during its instantiation.
>
> Now I'd like to do a MapReduce job on those files. That would require instantiating the codec, and supplying it with the filename, to determine the key used. Is it possible to do so with the current implementation of Hadoop?
>
> --
> Greg
>
> W dniu 2012-04-11 10:44, Zizon Qiu pisze:
>>
>> If your are:
>> 1. using TextInputFormat.
>> 2.all input files are ends with certain suffix like ".gz"
>> 3.the custom CompressionCodec already register in configuration and getDefaultExtension return the same suffix like as describe in 2.
>>
>> the nothing else you need to do.
>> hadoop will deal with it automatically.
>>
>> that means the input key&value in map method are already decompress.
>>
>> But,if the origin files dose not end with certain suffix,you need to write your own inputformat or subclass TextInputFormat , override the createRecordReader method which return your own RecordReader.
>> the InputSplit pass to the InputFormat is actually FileInputSplit,which you can retrieve the input file path.
>>
>> you may also take a look at the isSplitable method declared in InputFormat,if your files are not splitable.
>>
>> for more detail,refer to the TextInputFormat class implementation.
>>
>> On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia <sa...@student.agh.edu.pl> wrote:
>> Thanks for you reply! That clears some thing up
>> There is but one problem... My CompressionCodec has to be instantiated on a per-file basis, meaning it needs to know the name of the file it is to compress/decompress. I'm guessing that would not be possible with the current implementation?
>>
>> Or if so, how would I proceed with injecting it with the file name?
>> --
>> Greg
>>
>> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>>> append your custom codec full class name in "io.compression.codecs" either in mapred-site.xml or in the configuration object pass to Job constructor.
>>>
>>> the map reduce framework will try to guess the compress algorithm using the input files suffix.
>>>
>>> if any CompressionCodec.getDefaultExtension() register in the configuration match the suffix,hadoop will try to instantiate the codec and decompress for you ,if succeed,automatically.
>>>
>>> the default value for "io.compression.codecs" is "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>>>
>>> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <sa...@student.agh.edu.pl> wrote:
>>> Hello,
>>> I am trying to apply a custom CompressionCodec to work with MapReduce jobs, but I haven't found a way to inject it during the reading of input data, or during the write of the job results.
>>> Am I missing something, or is there no support for compressed files in the filesystem?
>>>
>>> I am well aware of how to set it up to work during the intermitent phases of the MapReduce operation, but I just can't find a way to apply it BEFORE the job takes place...
>>> Is there any other way except simply uncompressing the files I need prior to scheduling a job?
>>>
>>> Huge thanks for any help you can give me!
>>> --
>>> Greg
>>>
>>
>>
>
--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/
Re: CompressionCodec in MapReduce
Posted by Grzegorz Gunia <sa...@student.agh.edu.pl>.
I think we misunderstood here.
I'll base my question upon an example:
Lets say I want each of the files stored on my hdfs to be encrypted
prior to being physically stored on the cluster.
For that I'll write a custom CompressionCodec, that performs the
encryption, and use it during any edits/creations of files in the HDFS.
Then to make it more secure I'll make it so it uses different keys for
different files, and supply the keys to the codec during its instantiation.
Now I'd like to do a MapReduce job on those files. That would require
instantiating the codec, and supplying it with the filename, to
determine the key used. Is it possible to do so with the current
implementation of Hadoop?
--
Greg
W dniu 2012-04-11 10:44, Zizon Qiu pisze:
> If your are:
> 1. using TextInputFormat.
> 2.all input files are ends with certain suffix like ".gz"
> 3.the custom CompressionCodec already register in configuration and
> getDefaultExtension return the same suffix like as describe in 2.
>
> the nothing else you need to do.
> hadoop will deal with it automatically.
>
> that means the input key&value in map method are already decompress.
>
> But,if the origin files dose not end with certain suffix,you need
> to write your own inputformat or subclass TextInputFormat , override
> the createRecordReader method which return your own RecordReader.
> the InputSplit pass to the InputFormat is actually
> FileInputSplit,which you can retrieve the input file path.
>
> you may also take a look at the isSplitable method declared
> in InputFormat,if your files are not splitable.
>
> for more detail,refer to the TextInputFormat class implementation.
>
> On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia
> <sawtyss@student.agh.edu.pl <ma...@student.agh.edu.pl>> wrote:
>
> Thanks for you reply! That clears some thing up
> There is but one problem... My CompressionCodec has to be
> instantiated on a per-file basis, meaning it needs to know the
> name of the file it is to compress/decompress. I'm guessing that
> would not be possible with the current implementation?
>
> Or if so, how would I proceed with injecting it with the file name?
> --
> Greg
>
> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>> append your custom codec full class name in
>> "io.compression.codecs" either in mapred-site.xml or in the
>> configuration object pass to Job constructor.
>>
>> the map reduce framework will try to guess the compress algorithm
>> using the input files suffix.
>>
>> if any CompressionCodec.getDefaultExtension() register in the
>> configuration match the suffix,hadoop will try to instantiate the
>> codec and decompress for you ,if succeed,automatically.
>>
>> the default value for "io.compression.codecs" is
>> "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>>
>> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia
>> <sawtyss@student.agh.edu.pl <ma...@student.agh.edu.pl>>
>> wrote:
>>
>> Hello,
>> I am trying to apply a custom CompressionCodec to work with
>> MapReduce jobs, but I haven't found a way to inject it during
>> the reading of input data, or during the write of the job
>> results.
>> Am I missing something, or is there no support for compressed
>> files in the filesystem?
>>
>> I am well aware of how to set it up to work during the
>> intermitent phases of the MapReduce operation, but I just
>> can't find a way to apply it BEFORE the job takes place...
>> Is there any other way except simply uncompressing the files
>> I need prior to scheduling a job?
>>
>> Huge thanks for any help you can give me!
>> --
>> Greg
>>
>>
>
>
Re: CompressionCodec in MapReduce
Posted by Zizon Qiu <zz...@gmail.com>.
If your are:
1. using TextInputFormat.
2.all input files are ends with certain suffix like ".gz"
3.the custom CompressionCodec already register in configuration and
getDefaultExtension return the same suffix like as describe in 2.
the nothing else you need to do.
hadoop will deal with it automatically.
that means the input key&value in map method are already decompress.
But,if the origin files dose not end with certain suffix,you need to write
your own inputformat or subclass TextInputFormat , override the
createRecordReader method which return your own RecordReader.
the InputSplit pass to the InputFormat is actually FileInputSplit,which you
can retrieve the input file path.
you may also take a look at the isSplitable method declared
in InputFormat,if your files are not splitable.
for more detail,refer to the TextInputFormat class implementation.
On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia
<sa...@student.agh.edu.pl>wrote:
> Thanks for you reply! That clears some thing up
> There is but one problem... My CompressionCodec has to be instantiated on
> a per-file basis, meaning it needs to know the name of the file it is to
> compress/decompress. I'm guessing that would not be possible with the
> current implementation?
>
> Or if so, how would I proceed with injecting it with the file name?
> --
> Greg
>
> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>
> append your custom codec full class name in "io.compression.codecs" either
> in mapred-site.xml or in the configuration object pass to Job constructor.
>
> the map reduce framework will try to guess the compress algorithm using
> the input files suffix.
>
> if any CompressionCodec.getDefaultExtension() register in the
> configuration match the suffix,hadoop will try to instantiate the codec and
> decompress for you ,if succeed,automatically.
>
> the default value for "io.compression.codecs" is
> "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>
> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <
> sawtyss@student.agh.edu.pl> wrote:
>
>> Hello,
>> I am trying to apply a custom CompressionCodec to work with MapReduce
>> jobs, but I haven't found a way to inject it during the reading of input
>> data, or during the write of the job results.
>> Am I missing something, or is there no support for compressed files in
>> the filesystem?
>>
>> I am well aware of how to set it up to work during the intermitent phases
>> of the MapReduce operation, but I just can't find a way to apply it BEFORE
>> the job takes place...
>> Is there any other way except simply uncompressing the files I need prior
>> to scheduling a job?
>>
>> Huge thanks for any help you can give me!
>> --
>> Greg
>>
>
>
>
Re: CompressionCodec in MapReduce
Posted by Grzegorz Gunia <sa...@student.agh.edu.pl>.
Thanks for you reply! That clears some thing up
There is but one problem... My CompressionCodec has to be instantiated
on a per-file basis, meaning it needs to know the name of the file it is
to compress/decompress. I'm guessing that would not be possible with the
current implementation?
Or if so, how would I proceed with injecting it with the file name?
--
Greg
W dniu 2012-04-11 10:12, Zizon Qiu pisze:
> append your custom codec full class name in "io.compression.codecs"
> either in mapred-site.xml or in the configuration object pass to Job
> constructor.
>
> the map reduce framework will try to guess the compress algorithm
> using the input files suffix.
>
> if any CompressionCodec.getDefaultExtension() register in the
> configuration match the suffix,hadoop will try to instantiate the
> codec and decompress for you ,if succeed,automatically.
>
> the default value for "io.compression.codecs" is
> "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
>
> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia
> <sawtyss@student.agh.edu.pl <ma...@student.agh.edu.pl>> wrote:
>
> Hello,
> I am trying to apply a custom CompressionCodec to work with
> MapReduce jobs, but I haven't found a way to inject it during the
> reading of input data, or during the write of the job results.
> Am I missing something, or is there no support for compressed
> files in the filesystem?
>
> I am well aware of how to set it up to work during the intermitent
> phases of the MapReduce operation, but I just can't find a way to
> apply it BEFORE the job takes place...
> Is there any other way except simply uncompressing the files I
> need prior to scheduling a job?
>
> Huge thanks for any help you can give me!
> --
> Greg
>
>
Re: CompressionCodec in MapReduce
Posted by Zizon Qiu <zz...@gmail.com>.
append your custom codec full class name in "io.compression.codecs" either
in mapred-site.xml or in the configuration object pass to Job constructor.
the map reduce framework will try to guess the compress algorithm using the
input files suffix.
if any CompressionCodec.getDefaultExtension() register in the configuration
match the suffix,hadoop will try to instantiate the codec and decompress
for you ,if succeed,automatically.
the default value for "io.compression.codecs" is
"org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"
On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia
<sa...@student.agh.edu.pl>wrote:
> Hello,
> I am trying to apply a custom CompressionCodec to work with MapReduce
> jobs, but I haven't found a way to inject it during the reading of input
> data, or during the write of the job results.
> Am I missing something, or is there no support for compressed files in the
> filesystem?
>
> I am well aware of how to set it up to work during the intermitent phases
> of the MapReduce operation, but I just can't find a way to apply it BEFORE
> the job takes place...
> Is there any other way except simply uncompressing the files I need prior
> to scheduling a job?
>
> Huge thanks for any help you can give me!
> --
> Greg
>