You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Raymond Jennings III <ra...@yahoo.com> on 2010/06/24 17:26:20 UTC

Newbie to HDFS compression

Are there instructions on how to enable (which type?) of compression on hdfs?  Does this have to be done during installation or can it be added to a running cluster?

Thanks,
Ray


      

Re: Newbie to HDFS compression

Posted by Harsh J <qw...@gmail.com>.
On Fri, Jun 25, 2010 at 8:20 AM, James Seigel <ja...@tynt.com> wrote:
> OOps.  Replied to wrong email.
>
> Well I should add something useful to the conversation now.
>
> I think LZO has all the right features.  However, not great support in Pig if that is what you are using.
There's elephant-bird from Twitter that provides Pig-extensions for
Lzo store/load operations! It *almost* works out of the box (err, git
repository) :)
>
> It is good to have something splittable.  LZO - check
>
> Compress intermediate files...this is a no brainer.
>
> Stick with it...it is complicated ( a bit )  to install
>
> Cheers
> J
>
> On 2010-06-24, at 8:45 PM, James Seigel wrote:
>
>> Cool.  Maybe we should start a page.
>>
>> J
>> On 2010-06-24, at 8:16 PM, Harsh J wrote:
>>
>>> On Fri, Jun 25, 2010 at 2:42 AM, Raymond Jennings III
>>> <ra...@yahoo.com> wrote:
>>>> Oh, maybe that's what I meant :-)  I recall reading something on this mail group that "the compression" in not included with the hadoop binary and that you have to get and install it separately due to license incompatibilities.  Looking at the config xml files it's not clear what I need to do.  Thanks.
>>>>
>>> LZO Compression is the one you probably read about. Otherwise
>>> available CompressionCodecs are BZip2 and GZip, and you should be able
>>> to use those files just fine.
>>>
>>> Something like FileOutputFormat.setCompressOutput(conf, true);
>>>
>>> (Also look at mapred.output.compress configuration var for
>>> map-output-compression)
>>>>
>>>>
>>>> ----- Original Message ----
>>>> From: Eric Sammer <es...@cloudera.com>
>>>> To: common-user@hadoop.apache.org
>>>> Sent: Thu, June 24, 2010 5:09:33 PM
>>>> Subject: Re: Newbie to HDFS compression
>>>>
>>>> There is no file system level compression in HDFS. You can stored
>>>> compressed files in HDFS, however.
>>>>
>>>> On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
>>>> <ra...@yahoo.com> wrote:
>>>>> Are there instructions on how to enable (which type?) of compression on hdfs?  Does this have to be done during installation or can it be added to a running cluster?
>>>>>
>>>>> Thanks,
>>>>> Ray
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Eric Sammer
>>>> twitter: esammer
>>>> data: www.cloudera.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Harsh J
>>> www.harshj.com
>>
>
>



-- 
Harsh J
www.harshj.com

Re: Newbie to HDFS compression

Posted by James Seigel <ja...@tynt.com>.
OOps.  Replied to wrong email.  

Well I should add something useful to the conversation now.

I think LZO has all the right features.  However, not great support in Pig if that is what you are using.

It is good to have something splittable.  LZO - check

Compress intermediate files...this is a no brainer.

Stick with it...it is complicated ( a bit )  to install

Cheers
J

On 2010-06-24, at 8:45 PM, James Seigel wrote:

> Cool.  Maybe we should start a page.
> 
> J
> On 2010-06-24, at 8:16 PM, Harsh J wrote:
> 
>> On Fri, Jun 25, 2010 at 2:42 AM, Raymond Jennings III
>> <ra...@yahoo.com> wrote:
>>> Oh, maybe that's what I meant :-)  I recall reading something on this mail group that "the compression" in not included with the hadoop binary and that you have to get and install it separately due to license incompatibilities.  Looking at the config xml files it's not clear what I need to do.  Thanks.
>>> 
>> LZO Compression is the one you probably read about. Otherwise
>> available CompressionCodecs are BZip2 and GZip, and you should be able
>> to use those files just fine.
>> 
>> Something like FileOutputFormat.setCompressOutput(conf, true);
>> 
>> (Also look at mapred.output.compress configuration var for
>> map-output-compression)
>>> 
>>> 
>>> ----- Original Message ----
>>> From: Eric Sammer <es...@cloudera.com>
>>> To: common-user@hadoop.apache.org
>>> Sent: Thu, June 24, 2010 5:09:33 PM
>>> Subject: Re: Newbie to HDFS compression
>>> 
>>> There is no file system level compression in HDFS. You can stored
>>> compressed files in HDFS, however.
>>> 
>>> On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
>>> <ra...@yahoo.com> wrote:
>>>> Are there instructions on how to enable (which type?) of compression on hdfs?  Does this have to be done during installation or can it be added to a running cluster?
>>>> 
>>>> Thanks,
>>>> Ray
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Eric Sammer
>>> twitter: esammer
>>> data: www.cloudera.com
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Harsh J
>> www.harshj.com
> 


Re: Newbie to HDFS compression

Posted by James Seigel <ja...@tynt.com>.
Cool.  Maybe we should start a page.

J
On 2010-06-24, at 8:16 PM, Harsh J wrote:

> On Fri, Jun 25, 2010 at 2:42 AM, Raymond Jennings III
> <ra...@yahoo.com> wrote:
>> Oh, maybe that's what I meant :-)  I recall reading something on this mail group that "the compression" in not included with the hadoop binary and that you have to get and install it separately due to license incompatibilities.  Looking at the config xml files it's not clear what I need to do.  Thanks.
>> 
> LZO Compression is the one you probably read about. Otherwise
> available CompressionCodecs are BZip2 and GZip, and you should be able
> to use those files just fine.
> 
> Something like FileOutputFormat.setCompressOutput(conf, true);
> 
> (Also look at mapred.output.compress configuration var for
> map-output-compression)
>> 
>> 
>> ----- Original Message ----
>> From: Eric Sammer <es...@cloudera.com>
>> To: common-user@hadoop.apache.org
>> Sent: Thu, June 24, 2010 5:09:33 PM
>> Subject: Re: Newbie to HDFS compression
>> 
>> There is no file system level compression in HDFS. You can stored
>> compressed files in HDFS, however.
>> 
>> On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
>> <ra...@yahoo.com> wrote:
>>> Are there instructions on how to enable (which type?) of compression on hdfs?  Does this have to be done during installation or can it be added to a running cluster?
>>> 
>>> Thanks,
>>> Ray
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> Harsh J
> www.harshj.com


Re: Newbie to HDFS compression

Posted by Harsh J <qw...@gmail.com>.
On Fri, Jun 25, 2010 at 2:42 AM, Raymond Jennings III
<ra...@yahoo.com> wrote:
> Oh, maybe that's what I meant :-)  I recall reading something on this mail group that "the compression" in not included with the hadoop binary and that you have to get and install it separately due to license incompatibilities.  Looking at the config xml files it's not clear what I need to do.  Thanks.
>
LZO Compression is the one you probably read about. Otherwise
available CompressionCodecs are BZip2 and GZip, and you should be able
to use those files just fine.

Something like FileOutputFormat.setCompressOutput(conf, true);

(Also look at mapred.output.compress configuration var for
map-output-compression)
>
>
> ----- Original Message ----
> From: Eric Sammer <es...@cloudera.com>
> To: common-user@hadoop.apache.org
> Sent: Thu, June 24, 2010 5:09:33 PM
> Subject: Re: Newbie to HDFS compression
>
> There is no file system level compression in HDFS. You can stored
> compressed files in HDFS, however.
>
> On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
> <ra...@yahoo.com> wrote:
>> Are there instructions on how to enable (which type?) of compression on hdfs?  Does this have to be done during installation or can it be added to a running cluster?
>>
>> Thanks,
>> Ray
>>
>>
>>
>>
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>
>
>
>
>



-- 
Harsh J
www.harshj.com

Re: Newbie to HDFS compression

Posted by Josh Patterson <jo...@cloudera.com>.
Raymond,

LZO installation can be daunting even with the more recent
developments out there;

Most of this information is up at:

http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ

My quick guide: Installation for RedHat / Centos

- watch out for the various RPMs needed for lzo/2/devel support
- get the native libs in the hadoop/lib subdir from:
http://code.google.com/p/hadoop-gpl-compression/
- double check the permissions on these files; typically a set of "rw
rw r" permissions works well. also check the owner.
- get ant 1.8 to build the git repository if you are building any of the source
- move the lzo.jar into the hadoop/lib subdir


Changes to config: mapred-site.xml (add the following entries)

  <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
  </property>

  <property>
    <name>mapred.child.env</name>
    <value>JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native</value>
  </property>

  <property>
    <name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>


Changes to Config: core-site.xml

Add these entries:

<property>
    <name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
  </property>
  <property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>



hadoop-env.sh

export HADOOP_CLASSPATH=/usr/lib/hadoop/lib/hadoop-lzo-0.4.3.jar
export JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-i386-32 (or
the 64bit version)

Usage

for older (deprecated/undeprecated) API to use lzo files as input to a MR job:

conf.setInputFormat( DeprecatedLzoTextInputFormat.class );

Use "lzop" to compress the file

http://www.lzop.org/

To index the file for splitting on input:

In process locally:

hadoop jar /path/to/your/hadoop-lzo.jar
com.hadoop.compression.lzo.LzoIndexer big_file.lzo

On cluster, In MR:

hadoop jar /path/to/your/hadoop-lzo.jar
com.hadoop.compression.lzo.DistributedLzoIndexer
/hdfs/dir/big_file.lzo

To Compress the output of the entire job so that the output file in
hdfs is a LZO compressed file:

TextOutputFormat.setOutputCompressorClass(conf,
com.hadoop.compression.lzo.LzopCodec.class)
TextOutputFormat.setCompressOutput(conf, true);


Josh Patterson

Solutions Architect
Cloudera

On Thu, Jun 24, 2010 at 5:12 PM, Raymond Jennings III
<ra...@yahoo.com> wrote:
>
> Oh, maybe that's what I meant :-)  I recall reading something on this mail group that "the compression" in not included with the hadoop binary and that you have to get and install it separately due to license incompatibilities.  Looking at the config xml files it's not clear what I need to do.  Thanks.
>
>
>
> ----- Original Message ----
> From: Eric Sammer <es...@cloudera.com>
> To: common-user@hadoop.apache.org
> Sent: Thu, June 24, 2010 5:09:33 PM
> Subject: Re: Newbie to HDFS compression
>
> There is no file system level compression in HDFS. You can stored
> compressed files in HDFS, however.
>
> On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
> <ra...@yahoo.com> wrote:
> > Are there instructions on how to enable (which type?) of compression on hdfs?  Does this have to be done during installation or can it be added to a running cluster?
> >
> > Thanks,
> > Ray
> >
> >
> >
> >
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>
>
>
>

Re: Newbie to HDFS compression

Posted by Raymond Jennings III <ra...@yahoo.com>.
Oh, maybe that's what I meant :-)  I recall reading something on this mail group that "the compression" in not included with the hadoop binary and that you have to get and install it separately due to license incompatibilities.  Looking at the config xml files it's not clear what I need to do.  Thanks.



----- Original Message ----
From: Eric Sammer <es...@cloudera.com>
To: common-user@hadoop.apache.org
Sent: Thu, June 24, 2010 5:09:33 PM
Subject: Re: Newbie to HDFS compression

There is no file system level compression in HDFS. You can stored
compressed files in HDFS, however.

On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
<ra...@yahoo.com> wrote:
> Are there instructions on how to enable (which type?) of compression on hdfs?  Does this have to be done during installation or can it be added to a running cluster?
>
> Thanks,
> Ray
>
>
>
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com



      

Re: Newbie to HDFS compression

Posted by Eric Sammer <es...@cloudera.com>.
There is no file system level compression in HDFS. You can stored
compressed files in HDFS, however.

On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
<ra...@yahoo.com> wrote:
> Are there instructions on how to enable (which type?) of compression on hdfs?  Does this have to be done during installation or can it be added to a running cluster?
>
> Thanks,
> Ray
>
>
>
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com