You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Johan Oskarsson <jo...@oskarsson.nu> on 2009/03/03 09:32:20 UTC

Splittable lzo files

Hi,

thought I'd pass on this blog post I just wrote about how we compress 
our raw log data in Hadoop using Lzo at Last.fm.

The essence of the post is that we're able to make them splittable by 
indexing where each compressed chunk starts in the file, similar to the 
gzip input format being worked on.
This actually gives us a performance boost in certain jobs that read a 
lot of data while saving us disk space at the same time.

http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html

/Johan

Re: Splittable lzo files

Posted by Johan Oskarsson <jo...@oskarsson.nu>.

We use it with python (dumbo) and streaming, so it should certainly be 
possible. I haven't tried it myself though, so can't give any pointers.

/Johan

Miles Osborne wrote:
> that's very interesting.  for us poor souls using streaming, would we
> be able to use it?
> 
> (right now i'm looking at a 100+ GB gzipped file ...)
> 
> Miles
> 
> 2009/3/3 Johan Oskarsson <jo...@oskarsson.nu>:
>> Hi,
>>
>> thought I'd pass on this blog post I just wrote about how we compress our
>> raw log data in Hadoop using Lzo at Last.fm.
>>
>> The essence of the post is that we're able to make them splittable by
>> indexing where each compressed chunk starts in the file, similar to the gzip
>> input format being worked on.
>> This actually gives us a performance boost in certain jobs that read a lot
>> of data while saving us disk space at the same time.
>>
>> http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html
>>
>> /Johan
>>
> 
> 
>

Re: Splittable lzo files

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

that's very interesting.  for us poor souls using streaming, would we
be able to use it?

(right now i'm looking at a 100+ GB gzipped file ...)

Miles

2009/3/3 Johan Oskarsson <jo...@oskarsson.nu>:
> Hi,
>
> thought I'd pass on this blog post I just wrote about how we compress our
> raw log data in Hadoop using Lzo at Last.fm.
>
> The essence of the post is that we're able to make them splittable by
> indexing where each compressed chunk starts in the file, similar to the gzip
> input format being worked on.
> This actually gives us a performance boost in certain jobs that read a lot
> of data while saving us disk space at the same time.
>
> http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html
>
> /Johan
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Splittable lzo files

Posted by tim robertson <ti...@gmail.com>.

Thanks for posting this Johan,

I tried unsuccessfully to handle GZip files for the reasons you state
and resorted to uncompressed.  I will try the Lzo format and post the
performance difference of compressed vs uncompressed on EC2 which
seems to have very slow disk IO.  We have seen really bad import
speeds (like worse than mini macs even with the largest instances) on
postgis and mysql with EC2 so I think this might be very applicable to
the EC2 users.

Cheers,

Tim

On Tue, Mar 3, 2009 at 9:32 AM, Johan Oskarsson <jo...@oskarsson.nu> wrote:
> Hi,
>
> thought I'd pass on this blog post I just wrote about how we compress our
> raw log data in Hadoop using Lzo at Last.fm.
>
> The essence of the post is that we're able to make them splittable by
> indexing where each compressed chunk starts in the file, similar to the gzip
> input format being worked on.
> This actually gives us a performance boost in certain jobs that read a lot
> of data while saving us disk space at the same time.
>
> http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html
>
> /Johan
>