You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by stan lee <le...@gmail.com> on 2010/05/18 17:44:59 UTC

Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

Hi Guys,

I am trying to use compression to reduce the IO workload when trying to run
a job but failed. I have several questions which needs your help.

For lzo compression, I found a guide
http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said "Note
that you must have both 32-bit and 64-bit liblzo2 installed" ? I am not sure
whether it means that we also need 32bit liblzo2 installed even when we are
on 64bit system. If so, why?

Also if I don't use lzo compression and tried to use gzip to compress the
final reduce output file, I just set below value in mapred-site.xml, but
seems it doesn't work(how can I find the final .gz file compressed? I used
"hadoop dfs -l <dir>" and didn't find that.). My question: can we use gzip
to compress the final result when it's not streaming job? How can we ensure
that the compression has been enabled during a job execution?

<property>
       <name>mapred.output.compress</name>
       <value>true</value>
</property>

Thanks!
Stan Lee

Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

Posted by Ranjit Mathew <ra...@yahoo-inc.com>.
On Wednesday 19 May 2010 04:08 PM, stan lee wrote:
> Thanks Harsh, Hong and Ted. I still have question on how to build native
> library for gzip compression type?  I found below information from wiki:
> ******************************************************************************
> In particular the various packages you would need on the target platform
> are:
>
>     - C compiler (e.g. GNU C Compiler<http://gcc.gnu.org/>)
>     - GNU Autools Chain: autoconf<http://www.gnu.org/software/autoconf/>,
>     automake<http://www.gnu.org/software/automake/>,
> libtool<http://www.gnu.org/software/libtool/>
>     - zlib-development package (stable version>= 1.2.0)
>     - lzo-development package (stable version>= 2.0)
>
> Once you have the pre-requisites use the standard build.xml and pass along
> the compile.native flag (set to true) to build the native hadoop library:
>
> $ ant -Dcompile.native=true<target>
>   ***************************************************************************
> So what's the meaning of development package here? I know for lzo it's there
> is hadoop-lzo package..but what's that for gzip? I think gzip is written in
> c programe and shouldn't be built using ant/ivy?   Sorry that I am just a
> beginner to knock the door of hadoop. Thanks for answer for advance!

Note that the wiki asks for development packages for zlib and lzo, not
gzip. These development packages contain the headers, etc. that are
needed to create programmes that use their respective APIs.

For example, on Fedora 12, you can install these simply by executing:

   sudo yum install zlib-devel lzo-devel

Ranjit

Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

Posted by stan lee <le...@gmail.com>.
Thanks Harsh, Hong and Ted. I still have question on how to build native
library for gzip compression type?  I found below information from wiki:
******************************************************************************
In particular the various packages you would need on the target platform
are:

   - C compiler (e.g. GNU C Compiler <http://gcc.gnu.org/>)
   - GNU Autools Chain: autoconf <http://www.gnu.org/software/autoconf/>,
   automake <http://www.gnu.org/software/automake/>,
libtool<http://www.gnu.org/software/libtool/>
   - zlib-development package (stable version >= 1.2.0)
   - lzo-development package (stable version >= 2.0)

Once you have the pre-requisites use the standard build.xml and pass along
the compile.native flag (set to true) to build the native hadoop library:

$ ant -Dcompile.native=true <target>
 ***************************************************************************
So what's the meaning of development package here? I know for lzo it's there
is hadoop-lzo package..but what's that for gzip? I think gzip is written in
c programe and shouldn't be built using ant/ivy?   Sorry that I am just a
beginner to knock the door of hadoop. Thanks for answer for advance!

I have "make install" GNU gzip source code on my cluster node, would that
work if I directly copy the libraries generated to the dir
$HADOOP_HOME/lib/native/Linux_amd64-64?

Stan.Lee
On Wed, May 19, 2010 at 4:31 PM, stan lee <le...@gmail.com> wrote:

> Get the meaning now. As sort would use SequenceFileFormat to write the
> output file and to use gzip as the compression type, we need to use native
> library...would try that.
>
>
> On Wed, May 19, 2010 at 3:17 PM, stan lee <le...@gmail.com> wrote:
>
>> Thanks All. So if we don't call setCompressOutput() and
>> setOutputCompressorClass() funciton in the sort programe,we  just set
>> mapred.output.compress to true and set mapred.output.compression.codec to
>> org.apache.hadoop.io.compress.GzipCodec, that wouldn't have compressed
>> output file like part-xxxx.gz?
>>
>> On Wed, May 19, 2010 at 1:31 AM, Harsh J <qw...@gmail.com> wrote:
>>
>>> Hi stan,
>>>
>>> You can do something of this sort if you use FileOutputFormat, from
>>> within your Job Driver:
>>>
>>>    FileOutputFormat.setCompressOutput(job, true);
>>>    FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>>    // GzipCodec from org.apache.hadoop.io.compress.
>>>    // and where 'job' is either JobConf or Job object.
>>>
>>> This will write the simple file output in Gzip format. You also have
>>> BZip2Codec.
>>>
>>> On Tue, May 18, 2010 at 9:14 PM, stan lee <le...@gmail.com> wrote:
>>> > Hi Guys,
>>> >
>>> > I am trying to use compression to reduce the IO workload when trying to
>>> run
>>> > a job but failed. I have several questions which needs your help.
>>> >
>>> > For lzo compression, I found a guide
>>> > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said
>>> "Note
>>> > that you must have both 32-bit and 64-bit liblzo2 installed" ? I am not
>>> sure
>>> > whether it means that we also need 32bit liblzo2 installed even when we
>>> are
>>> > on 64bit system. If so, why?
>>> >
>>> > Also if I don't use lzo compression and tried to use gzip to compress
>>> the
>>> > final reduce output file, I just set below value in mapred-site.xml,
>>> but
>>> > seems it doesn't work(how can I find the final .gz file compressed? I
>>> used
>>> > "hadoop dfs -l <dir>" and didn't find that.). My question: can we use
>>> gzip
>>> > to compress the final result when it's not streaming job? How can we
>>> ensure
>>> > that the compression has been enabled during a job execution?
>>> >
>>> > <property>
>>> >       <name>mapred.output.compress</name>
>>> >       <value>true</value>
>>> > </property>
>>> >
>>> > Thanks!
>>> > Stan Lee
>>> >
>>>
>>>
>>>
>>>  --
>>> Harsh J
>>> www.harshj.com
>>>
>>
>>
>

Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

Posted by stan lee <le...@gmail.com>.
Get the meaning now. As sort would use SequenceFileFormat to write the
output file and to use gzip as the compression type, we need to use native
library...would try that.


On Wed, May 19, 2010 at 3:17 PM, stan lee <le...@gmail.com> wrote:

> Thanks All. So if we don't call setCompressOutput() and
> setOutputCompressorClass() funciton in the sort programe,we  just set
> mapred.output.compress to true and set mapred.output.compression.codec to
> org.apache.hadoop.io.compress.GzipCodec, that wouldn't have compressed
> output file like part-xxxx.gz?
>
> On Wed, May 19, 2010 at 1:31 AM, Harsh J <qw...@gmail.com> wrote:
>
>> Hi stan,
>>
>> You can do something of this sort if you use FileOutputFormat, from
>> within your Job Driver:
>>
>>    FileOutputFormat.setCompressOutput(job, true);
>>    FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>    // GzipCodec from org.apache.hadoop.io.compress.
>>    // and where 'job' is either JobConf or Job object.
>>
>> This will write the simple file output in Gzip format. You also have
>> BZip2Codec.
>>
>> On Tue, May 18, 2010 at 9:14 PM, stan lee <le...@gmail.com> wrote:
>> > Hi Guys,
>> >
>> > I am trying to use compression to reduce the IO workload when trying to
>> run
>> > a job but failed. I have several questions which needs your help.
>> >
>> > For lzo compression, I found a guide
>> > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said
>> "Note
>> > that you must have both 32-bit and 64-bit liblzo2 installed" ? I am not
>> sure
>> > whether it means that we also need 32bit liblzo2 installed even when we
>> are
>> > on 64bit system. If so, why?
>> >
>> > Also if I don't use lzo compression and tried to use gzip to compress
>> the
>> > final reduce output file, I just set below value in mapred-site.xml, but
>> > seems it doesn't work(how can I find the final .gz file compressed? I
>> used
>> > "hadoop dfs -l <dir>" and didn't find that.). My question: can we use
>> gzip
>> > to compress the final result when it's not streaming job? How can we
>> ensure
>> > that the compression has been enabled during a job execution?
>> >
>> > <property>
>> >       <name>mapred.output.compress</name>
>> >       <value>true</value>
>> > </property>
>> >
>> > Thanks!
>> > Stan Lee
>> >
>>
>>
>>
>>  --
>> Harsh J
>> www.harshj.com
>>
>
>

Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

Posted by stan lee <le...@gmail.com>.
Thanks All. So if we don't call setCompressOutput() and
setOutputCompressorClass() funciton in the sort programe,we  just set
mapred.output.compress to true and set mapred.output.compression.codec to
org.apache.hadoop.io.compress.GzipCodec, that wouldn't have compressed
output file like part-xxxx.gz?

On Wed, May 19, 2010 at 1:31 AM, Harsh J <qw...@gmail.com> wrote:

> Hi stan,
>
> You can do something of this sort if you use FileOutputFormat, from
> within your Job Driver:
>
>    FileOutputFormat.setCompressOutput(job, true);
>    FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>    // GzipCodec from org.apache.hadoop.io.compress.
>    // and where 'job' is either JobConf or Job object.
>
> This will write the simple file output in Gzip format. You also have
> BZip2Codec.
>
> On Tue, May 18, 2010 at 9:14 PM, stan lee <le...@gmail.com> wrote:
> > Hi Guys,
> >
> > I am trying to use compression to reduce the IO workload when trying to
> run
> > a job but failed. I have several questions which needs your help.
> >
> > For lzo compression, I found a guide
> > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said
> "Note
> > that you must have both 32-bit and 64-bit liblzo2 installed" ? I am not
> sure
> > whether it means that we also need 32bit liblzo2 installed even when we
> are
> > on 64bit system. If so, why?
> >
> > Also if I don't use lzo compression and tried to use gzip to compress the
> > final reduce output file, I just set below value in mapred-site.xml, but
> > seems it doesn't work(how can I find the final .gz file compressed? I
> used
> > "hadoop dfs -l <dir>" and didn't find that.). My question: can we use
> gzip
> > to compress the final result when it's not streaming job? How can we
> ensure
> > that the compression has been enabled during a job execution?
> >
> > <property>
> >       <name>mapred.output.compress</name>
> >       <value>true</value>
> > </property>
> >
> > Thanks!
> > Stan Lee
> >
>
>
>
>  --
> Harsh J
> www.harshj.com
>

Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

Posted by Harsh J <qw...@gmail.com>.
Hi stan,

You can do something of this sort if you use FileOutputFormat, from
within your Job Driver:

    FileOutputFormat.setCompressOutput(job, true);
    FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
    // GzipCodec from org.apache.hadoop.io.compress.
    // and where 'job' is either JobConf or Job object.

This will write the simple file output in Gzip format. You also have BZip2Codec.

On Tue, May 18, 2010 at 9:14 PM, stan lee <le...@gmail.com> wrote:
> Hi Guys,
>
> I am trying to use compression to reduce the IO workload when trying to run
> a job but failed. I have several questions which needs your help.
>
> For lzo compression, I found a guide
> http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said "Note
> that you must have both 32-bit and 64-bit liblzo2 installed" ? I am not sure
> whether it means that we also need 32bit liblzo2 installed even when we are
> on 64bit system. If so, why?
>
> Also if I don't use lzo compression and tried to use gzip to compress the
> final reduce output file, I just set below value in mapred-site.xml, but
> seems it doesn't work(how can I find the final .gz file compressed? I used
> "hadoop dfs -l <dir>" and didn't find that.). My question: can we use gzip
> to compress the final result when it's not streaming job? How can we ensure
> that the compression has been enabled during a job execution?
>
> <property>
>       <name>mapred.output.compress</name>
>       <value>true</value>
> </property>
>
> Thanks!
> Stan Lee
>



-- 
Harsh J
www.harshj.com

Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

Posted by Hong Tang <ht...@yahoo-inc.com>.
Stan,

See my comments inline.

Thanks, Hong

On May 18, 2010, at 8:44 AM, stan lee wrote:

> Hi Guys,
>
> I am trying to use compression to reduce the IO workload when trying  
> to run
> a job but failed. I have several questions which needs your help.
>
> For lzo compression, I found a guide
> http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it  
> said "Note
> that you must have both 32-bit and 64-bit liblzo2 installed" ? I am  
> not sure
> whether it means that we also need 32bit liblzo2 installed even when  
> we are
> on 64bit system. If so, why?

The answer on the wiki page is to the question of how to set up the  
native libraries so that both 32-bit AND 64-bit java would work. If  
you adhere to an environment with the same flavor of java across the  
whole cluster, then the solution would not apply to you.

> Also if I don't use lzo compression and tried to use gzip to  
> compress the
> final reduce output file, I just set below value in mapred-site.xml,  
> but
> seems it doesn't work(how can I find the final .gz file compressed?  
> I used
> "hadoop dfs -l <dir>" and didn't find that.). My question: can we  
> use gzip
> to compress the final result when it's not streaming job? How can we  
> ensure
> that the compression has been enabled during a job execution?
>
> <property>
>       <name>mapred.output.compress</name>
>       <value>true</value>
> </property>
>

The truth is, this option is honored by the implementation of  
OutputFormat classes.  If you use TextOutputFormat, then you should  
see files like "part-xxxx.gz" in the output directory. If you write  
your own output format class, then you should follow the  
implementations of TextOutputFormat or SequenceFileOutputFormat to set  
up compression properly.


Re: Do we need to install both 32 and 64 bit lzo2 to enable lzo compression and how can we use gzip compressoin codec in hadoop

Posted by Ted Yu <yu...@gmail.com>.
32bit liblzo2 isn't needed on 64-bit systems.

On Tue, May 18, 2010 at 8:44 AM, stan lee <le...@gmail.com> wrote:

> Hi Guys,
>
> I am trying to use compression to reduce the IO workload when trying to run
> a job but failed. I have several questions which needs your help.
>
> For lzo compression, I found a guide
> http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said
> "Note
> that you must have both 32-bit and 64-bit liblzo2 installed" ? I am not
> sure
> whether it means that we also need 32bit liblzo2 installed even when we are
> on 64bit system. If so, why?
>
> Also if I don't use lzo compression and tried to use gzip to compress the
> final reduce output file, I just set below value in mapred-site.xml, but
> seems it doesn't work(how can I find the final .gz file compressed? I used
> "hadoop dfs -l <dir>" and didn't find that.). My question: can we use gzip
> to compress the final result when it's not streaming job? How can we ensure
> that the compression has been enabled during a job execution?
>
> <property>
>       <name>mapred.output.compress</name>
>       <value>true</value>
> </property>
>
> Thanks!
> Stan Lee
>