You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by u235sentinel <u2...@gmail.com> on 2010/04/03 19:45:07 UTC

Does Hadoop compress files?

I'm starting to evaluate Hadoop.  We are currently running Sensage and 
store a lot of log files in our current environment.  I've been looking 
at the Hadoop forums and googling (of course) but haven't learned if 
Hadoop HDFS does any compression to files we store.

On the average we're storing about 600 gigs a week in log files (more or 
less).  Generally we need to store about 1 1/2 - 2 years of logs.  With 
Sensage compression we can store about 200+ Tb of logs in our current 
environment.

As I said, we're starting to evaluate if Hadoop would be a good 
replacement to our Sensage environment (or at least augment it).

Thanks a bunch!!

Re: Does Hadoop compress files?

Posted by Eric Sammer <es...@cloudera.com>.

See below.

On Sun, Apr 4, 2010 at 3:32 PM, u235sentinel <u2...@gmail.com> wrote:
> Ok that's what I was thinking.  I was wondering if Hadoop did on the fly
> compression as it stored files in HDFS like Sensage does.  But it sounds
> like Hadoop will take a compressed file and store it as compressed which is
> fine by me.  Sensage will do that same.

That's correct.

> I believe this answers the question.  Sonal's link suggests there is support
> for compression using zlib, gzip and bzip2.
> One more question though.  So storing files in compressed format, any issues
> with searching that data?  I'm curious if there is a disadvantage in doing
> this.  I could build bigger and badder servers but was hoping for
> compression.

Just to be super specific about this, you can write data in any format
into HDFS. If you can turn it into java primitives (including bytes),
you can write it to HDFS. The second half of the question is what are
my options for processing this data? If you plan on using Hadoop map
reduce to process these files you'll want to make sure you use a
compression format that Hadoop can "split" for parallel processing
which only a subset of these are. If you aren't planning on using the
MR component of Hadoop you can do whatever you'd like. You can still
write map reduce jobs on non-splittable compression formats, but
Hadoop will not be able to process a single file concurrently and
instead will have to process an entire file in one task. The best
option here is to dig into the docs a bit and figure out if what you
want to do will be possible and take care of these details in the
beginning.

> Thanks
>
>
>
> Eric Sammer wrote:
>>
>> To clarify, there is no implicit compression in HDFS. In other words,
>> if you want your data to be compressed, you have to write it that way.
>> If you plan on writing map reduce jobs to process the compressed data,
>> you'll want to use a splittable compression format. This generally
>> means LZO or block compressed SequenceFiles which others have
>> mentioned.
>>
>>
>>
>>
>>
>
>

-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Re: Does Hadoop compress files?

Posted by u235sentinel <u2...@gmail.com>.

Ok that's what I was thinking.  I was wondering if Hadoop did on the fly 
compression as it stored files in HDFS like Sensage does.  But it sounds 
like Hadoop will take a compressed file and store it as compressed which 
is fine by me.  Sensage will do that same.

I believe this answers the question.  Sonal's link suggests there is 
support for compression using zlib, gzip and bzip2. 

One more question though.  So storing files in compressed format, any 
issues with searching that data?  I'm curious if there is a disadvantage 
in doing this.  I could build bigger and badder servers but was hoping 
for compression.

Thanks

Eric Sammer wrote:
> To clarify, there is no implicit compression in HDFS. In other words,
> if you want your data to be compressed, you have to write it that way.
> If you plan on writing map reduce jobs to process the compressed data,
> you'll want to use a splittable compression format. This generally
> means LZO or block compressed SequenceFiles which others have
> mentioned.
>   
>
>
>
>
>

Re: Does Hadoop compress files?

Posted by Eric Sammer <es...@cloudera.com>.

To clarify, there is no implicit compression in HDFS. In other words,
if you want your data to be compressed, you have to write it that way.
If you plan on writing map reduce jobs to process the compressed data,
you'll want to use a splittable compression format. This generally
means LZO or block compressed SequenceFiles which others have
mentioned.

On Sat, Apr 3, 2010 at 10:45 AM, u235sentinel <u2...@gmail.com> wrote:
> I'm starting to evaluate Hadoop.  We are currently running Sensage and store
> a lot of log files in our current environment.  I've been looking at the
> Hadoop forums and googling (of course) but haven't learned if Hadoop HDFS
> does any compression to files we store.
>
> On the average we're storing about 600 gigs a week in log files (more or
> less).  Generally we need to store about 1 1/2 - 2 years of logs.  With
> Sensage compression we can store about 200+ Tb of logs in our current
> environment.
>
> As I said, we're starting to evaluate if Hadoop would be a good replacement
> to our Sensage environment (or at least augment it).
>
> Thanks a bunch!!
>



-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Re: Does Hadoop compress files?

Posted by Rajesh Balamohan <ra...@gmail.com>.

There is a facility in Hadoop to compress "intermediate mapoutput and job
output". Is your question related to reading compressed files itself into
hadoop?

If so, refer SequenceFileInputFormat. (
http://developer.yahoo.com/hadoop/tutorial/module4.html )

 the *SequenceFileInputFormat* reads special binary files that are specific
to Hadoop. These files include many features designed to allow data to be
rapidly read into Hadoop mappers. Sequence files are block-compressed and
provide direct serialization and deserialization of several arbitrary data
types (not just text). Sequence files can be generated as the output of
other MapReduce tasks and are an efficient intermediate representation for
data that is passing from one MapReduce job to anther.

On Sat, Apr 3, 2010 at 11:15 PM, u235sentinel <u2...@gmail.com>wrote:

> I'm starting to evaluate Hadoop.  We are currently running Sensage and
> store a lot of log files in our current environment.  I've been looking at
> the Hadoop forums and googling (of course) but haven't learned if Hadoop
> HDFS does any compression to files we store.
>
> On the average we're storing about 600 gigs a week in log files (more or
> less).  Generally we need to store about 1 1/2 - 2 years of logs.  With
> Sensage compression we can store about 200+ Tb of logs in our current
> environment.
>
> As I said, we're starting to evaluate if Hadoop would be a good replacement
> to our Sensage environment (or at least augment it).
>
> Thanks a bunch!!
>

-- 
~Rajesh.B

Re: Does Hadoop compress files?

Posted by Sonal Goyal <so...@gmail.com>.

Hi,

Please check
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Data+Compression

Thanks and Regards,
Sonal
www.meghsoft.com


On Sat, Apr 3, 2010 at 11:15 PM, u235sentinel <u2...@gmail.com>wrote:

> I'm starting to evaluate Hadoop.  We are currently running Sensage and
> store a lot of log files in our current environment.  I've been looking at
> the Hadoop forums and googling (of course) but haven't learned if Hadoop
> HDFS does any compression to files we store.
>
> On the average we're storing about 600 gigs a week in log files (more or
> less).  Generally we need to store about 1 1/2 - 2 years of logs.  With
> Sensage compression we can store about 200+ Tb of logs in our current
> environment.
>
> As I said, we're starting to evaluate if Hadoop would be a good replacement
> to our Sensage environment (or at least augment it).
>
> Thanks a bunch!!
>