You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Adam Portley <a_...@hotmail.com> on 2010/11/20 16:14:03 UTC

hbase compression

Hi, 
I'm using a map-reduce job with HFileOutputFormat followed by bulk loads/merges to create and populate a table with multiple column familes.  I would like to understand how compression works, and how to specify a non-default compression in this setup.  So: 

AFAIK, there are two relevant switches: the per-column-family compression configuration and hfile.compression.  Are there any others? 
Can the compression format be deduced from the contents of a HFile, or does the format of a region store file have to match the family's configuration? 
Can a column family's compression format be changed if it already contains some data?  If so, how is this done?  Are the family store files converted to the new format before the table comes back online, or is it a lazy-update, or just a compaction-time thing? 
Is it possible to write updates for multiple families with different compression formats in the same map-reduce job? 
Can HFileOutputFormat:configureIncrementalLoad infer compression format from an existing table, just as it does for partitioning? 
Is there a way to specify a default compression which is not None, so that new tables and families are automatically compressed (with gzip for example)? 
I have seen archived discussions which refer to RECORD vs BLOCK compression, but I don't see those options in later versions.  Have they gone away? 

Thanks, 
--Adam

Re: hbase compression

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Adam,

Answers inline below:

On Sat, Nov 20, 2010 at 7:14 AM, Adam Portley <a_...@hotmail.com> wrote:

>
> Hi,
> I'm using a map-reduce job with HFileOutputFormat followed by bulk
> loads/merges to create and populate a table with multiple column familes.  I
> would like to understand how compression works, and how to specify a
> non-default compression in this setup.  So:
>
> AFAIK, there are two relevant switches: the per-column-family compression
> configuration and hfile.compression.  Are there any others?
>

I believe those are the only ones

> Can the compression format be deduced from the contents of a HFile, or does
> the format of a region store file have to match the family's configuration?
>

The HFile header contains the information about the codec, so it doesn't
necessarily have to match.

> Can a column family's compression format be changed if it already contains
> some data?  If so, how is this done?  Are the family store files converted
> to the new format before the table comes back online, or is it a
> lazy-update, or just a compaction-time thing?
>

If the CF already has data, the existing data won't be automatically updated
with the new codec. As new hfiles are flushed, or compactions run, new data
will take the codec specified in the CF properties. So, over time, through
compactions, the codec will switch over. Of course you can force a major
compaction to switch it over at any point.

> Is it possible to write updates for multiple families with different
> compression formats in the same map-reduce job?
>

Currently, multicolumn support for HFileOutputFormat is still in the works.
This does seem like a reasonable requirement, though - hopefully it will be
addressed.

> Can HFileOutputFormat:configureIncrementalLoad infer compression format
> from an existing table, just as it does for partitioning?
>

I don't think it does right now, but don't see a reason that it shouldn't.
Can you file a JIRA, and perhaps even put up a patch? It sounds like you
understand this stuff pretty well, would be great to have you contributing
to the project.

> Is there a way to specify a default compression which is not None, so that
> new tables and families are automatically compressed (with gzip for
> example)?
>

I'm not aware of one but there might be. Unfortunately I'm booted into
Windows (ugh) at the moment so don't have the source code handy ;-) Maybe
someone else can answer this.

I have seen archived discussions which refer to RECORD vs BLOCK compression,
> but I don't see those options in later versions.  Have they gone away?
>
>
Yes, all compression now is block-based (like the old BLOCK setting). HBase
used to use the MapFile format from Hadoop which supported that option, but
in HBase 0.20.0 the new HFile format was developed which is always block
based.

Of course nothing stops you from storing compressed records if you see fit!
HBase does fine with binary values.

Thanks
-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera