You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by rakesh sharma <ra...@hotmail.com> on 2012/05/23 20:15:04 UTC

Recommendations for compression

Hi Guys,
I am writing data in hadoop using java client. The source of data for java client is a messaging data. The java client rotates files every 15 minutes. I use PigServer to submit map reduce job on the just closed file. These files have data in text format and are very large in size. I am not using any compression currently but would like to explore as amount of data is increasing day-by-day.
I need to use a compression while writing data to hadoop and make pig aware of this compression while submitting map reduce jobs. I am looking for some guidance to understand my options.
Thanks,Rakesh 		 	   		  

RE: Recommendations for compression

Posted by rakesh sharma <ra...@hotmail.com>.
Hi Prashant,
Thanks so much for such a quick response. Based on your analysis, I am leaning towards LZO. I have a very basic question - I am using org.apache.hadoop.fs.FSDataOutputStream to write data to hadoop incrementally (i.e. appending data as it is coming from messaging system).  I know I can gzip it using something like:FSDataOutputSTream  out = fs.create(file);GZIPOutputStream gzip = new GZIPOutputStream(out);		
gzip.write("sss".getBytes("UTF8");Is there something similar available in LZO library?Thanks,Rakesh
> Date: Wed, 23 May 2012 11:28:44 -0700
> Subject: Re: Recommendations for compression
> From: prash1784@gmail.com
> To: user@pig.apache.org
> 
> Hi Rakesh,
> 
> You have quite a few options based on space-time tradeoff you want to make.
> 
> Gzip compresses well but is CPU intensive - not splittable so parallelism
> and network IO suffers
> 
> Snappy is not space efficient but is easy on CPU (great for map output
> compression) - not splittable, unless you use it within a container like
> SequenceFile
> 
> LZO has a good space-time balance and is used by several companies
> operating Hadoop (LZO is splittable and fast which is a major advantage in
> using it) https://github.com/kevinweil/hadoop-lzo
> 
> Bzip2 compresses well, is splittable but is CPU intensive.
> 
> Based on your requirements, you could go with one of these. Makes sense?
> 
> 
> On Wed, May 23, 2012 at 11:15 AM, rakesh sharma <rakesh_sharma66@hotmail.com
> > wrote:
> 
> >
> > Hi Guys,
> > I am writing data in hadoop using java client. The source of data for java
> > client is a messaging data. The java client rotates files every 15 minutes.
> > I use PigServer to submit map reduce job on the just closed file. These
> > files have data in text format and are very large in size. I am not using
> > any compression currently but would like to explore as amount of data is
> > increasing day-by-day.
> > I need to use a compression while writing data to hadoop and make pig
> > aware of this compression while submitting map reduce jobs. I am looking
> > for some guidance to understand my options.
> > Thanks,Rakesh
 		 	   		  

Re: Recommendations for compression

Posted by Prashant Kommireddi <pr...@gmail.com>.
Hi Rakesh,

You have quite a few options based on space-time tradeoff you want to make.

Gzip compresses well but is CPU intensive - not splittable so parallelism
and network IO suffers

Snappy is not space efficient but is easy on CPU (great for map output
compression) - not splittable, unless you use it within a container like
SequenceFile

LZO has a good space-time balance and is used by several companies
operating Hadoop (LZO is splittable and fast which is a major advantage in
using it) https://github.com/kevinweil/hadoop-lzo

Bzip2 compresses well, is splittable but is CPU intensive.

Based on your requirements, you could go with one of these. Makes sense?


On Wed, May 23, 2012 at 11:15 AM, rakesh sharma <rakesh_sharma66@hotmail.com
> wrote:

>
> Hi Guys,
> I am writing data in hadoop using java client. The source of data for java
> client is a messaging data. The java client rotates files every 15 minutes.
> I use PigServer to submit map reduce job on the just closed file. These
> files have data in text format and are very large in size. I am not using
> any compression currently but would like to explore as amount of data is
> increasing day-by-day.
> I need to use a compression while writing data to hadoop and make pig
> aware of this compression while submitting map reduce jobs. I am looking
> for some guidance to understand my options.
> Thanks,Rakesh