You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Geelong Yao <ge...@gmail.com> on 2013/06/20 08:29:33 UTC

some idea about the Data Compression

Hi , everyone

I am working on the data compression
1.data compression before the raw data were uploaded into HDFS.
2.data compression while processing in Hadoop to reduce the pressure on IO.


Can anyone give me some ideas on above 2 directions


BRs
Geelong

-- 
>From Good To Great

RE: some idea about the Data Compression

Posted by John Lilley <jo...@redpoint.net>.
Geelong,


1.       These files will probably be some standard format like .gz or .bz2 or .zip.  In that case, pick an appropriate InputFormat.  See e.g. http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/, http://stackoverflow.com/questions/14497572/reading-gzipped-file-in-hadoop-using-custom-recordreader

2.       Generally, compression is a Good Thing and will improve performance.  But only if you use a fast compressor like LZO or Snappy.  Gzip, ZIP, BZ2, etc are no good for this.  You also need to ensure that your compressed files are "splittable" if you are going to create a single file that will be processed by a later MR stage, for this a SequenceFile is helpful.  For typical intermediate outputs it doesn't matter as much because you will have a folder of file parts and these are "pre split" in some sense.  Once upon a time, LZO compression was a thing that you had to install as a separate component, but I think the modern distros include it.  See for example: http://kickstarthadoop.blogspot.com/2012/02/use-compression-with-mapreduce.html , http://blog.cloudera.com/blog/2009/05/10-mapreduce-tips/, http://my.safaribooksonline.com/book/software-engineering-and-development/9781449328917/compression/id3689058, https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression (section 4.2 in the Elephant book).

John

From: Geelong Yao [mailto:geelongyao@gmail.com]
Sent: Thursday, June 20, 2013 12:30 AM
To: user@hadoop.apache.org
Subject: some idea about the Data Compression

Hi , everyone

I am working on the data compression
1.data compression before the raw data were uploaded into HDFS.
2.data compression while processing in Hadoop to reduce the pressure on IO.


Can anyone give me some ideas on above 2 directions


BRs
Geelong

--
>From Good To Great

RE: some idea about the Data Compression

Posted by John Lilley <jo...@redpoint.net>.
Geelong,


1.       These files will probably be some standard format like .gz or .bz2 or .zip.  In that case, pick an appropriate InputFormat.  See e.g. http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/, http://stackoverflow.com/questions/14497572/reading-gzipped-file-in-hadoop-using-custom-recordreader

2.       Generally, compression is a Good Thing and will improve performance.  But only if you use a fast compressor like LZO or Snappy.  Gzip, ZIP, BZ2, etc are no good for this.  You also need to ensure that your compressed files are "splittable" if you are going to create a single file that will be processed by a later MR stage, for this a SequenceFile is helpful.  For typical intermediate outputs it doesn't matter as much because you will have a folder of file parts and these are "pre split" in some sense.  Once upon a time, LZO compression was a thing that you had to install as a separate component, but I think the modern distros include it.  See for example: http://kickstarthadoop.blogspot.com/2012/02/use-compression-with-mapreduce.html , http://blog.cloudera.com/blog/2009/05/10-mapreduce-tips/, http://my.safaribooksonline.com/book/software-engineering-and-development/9781449328917/compression/id3689058, https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression (section 4.2 in the Elephant book).

John

From: Geelong Yao [mailto:geelongyao@gmail.com]
Sent: Thursday, June 20, 2013 12:30 AM
To: user@hadoop.apache.org
Subject: some idea about the Data Compression

Hi , everyone

I am working on the data compression
1.data compression before the raw data were uploaded into HDFS.
2.data compression while processing in Hadoop to reduce the pressure on IO.


Can anyone give me some ideas on above 2 directions


BRs
Geelong

--
>From Good To Great

RE: some idea about the Data Compression

Posted by John Lilley <jo...@redpoint.net>.
Geelong,


1.       These files will probably be some standard format like .gz or .bz2 or .zip.  In that case, pick an appropriate InputFormat.  See e.g. http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/, http://stackoverflow.com/questions/14497572/reading-gzipped-file-in-hadoop-using-custom-recordreader

2.       Generally, compression is a Good Thing and will improve performance.  But only if you use a fast compressor like LZO or Snappy.  Gzip, ZIP, BZ2, etc are no good for this.  You also need to ensure that your compressed files are "splittable" if you are going to create a single file that will be processed by a later MR stage, for this a SequenceFile is helpful.  For typical intermediate outputs it doesn't matter as much because you will have a folder of file parts and these are "pre split" in some sense.  Once upon a time, LZO compression was a thing that you had to install as a separate component, but I think the modern distros include it.  See for example: http://kickstarthadoop.blogspot.com/2012/02/use-compression-with-mapreduce.html , http://blog.cloudera.com/blog/2009/05/10-mapreduce-tips/, http://my.safaribooksonline.com/book/software-engineering-and-development/9781449328917/compression/id3689058, https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression (section 4.2 in the Elephant book).

John

From: Geelong Yao [mailto:geelongyao@gmail.com]
Sent: Thursday, June 20, 2013 12:30 AM
To: user@hadoop.apache.org
Subject: some idea about the Data Compression

Hi , everyone

I am working on the data compression
1.data compression before the raw data were uploaded into HDFS.
2.data compression while processing in Hadoop to reduce the pressure on IO.


Can anyone give me some ideas on above 2 directions


BRs
Geelong

--
>From Good To Great

RE: some idea about the Data Compression

Posted by John Lilley <jo...@redpoint.net>.
Geelong,


1.       These files will probably be some standard format like .gz or .bz2 or .zip.  In that case, pick an appropriate InputFormat.  See e.g. http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/, http://stackoverflow.com/questions/14497572/reading-gzipped-file-in-hadoop-using-custom-recordreader

2.       Generally, compression is a Good Thing and will improve performance.  But only if you use a fast compressor like LZO or Snappy.  Gzip, ZIP, BZ2, etc are no good for this.  You also need to ensure that your compressed files are "splittable" if you are going to create a single file that will be processed by a later MR stage, for this a SequenceFile is helpful.  For typical intermediate outputs it doesn't matter as much because you will have a folder of file parts and these are "pre split" in some sense.  Once upon a time, LZO compression was a thing that you had to install as a separate component, but I think the modern distros include it.  See for example: http://kickstarthadoop.blogspot.com/2012/02/use-compression-with-mapreduce.html , http://blog.cloudera.com/blog/2009/05/10-mapreduce-tips/, http://my.safaribooksonline.com/book/software-engineering-and-development/9781449328917/compression/id3689058, https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression (section 4.2 in the Elephant book).

John

From: Geelong Yao [mailto:geelongyao@gmail.com]
Sent: Thursday, June 20, 2013 12:30 AM
To: user@hadoop.apache.org
Subject: some idea about the Data Compression

Hi , everyone

I am working on the data compression
1.data compression before the raw data were uploaded into HDFS.
2.data compression while processing in Hadoop to reduce the pressure on IO.


Can anyone give me some ideas on above 2 directions


BRs
Geelong

--
>From Good To Great