You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2021/05/10 18:01:00 UTC

[jira] [Commented] (HBASE-25869) WAL value compression

    [ https://issues.apache.org/jira/browse/HBASE-25869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342054#comment-17342054 ] 

Andrew Kyle Purtell commented on HBASE-25869:
---------------------------------------------

*WAL Compression Results*

Site configuration:

{noformat}
<!-- retain all WALs  -->
<property>
   <name>hbase.master.logcleaner.ttl</name>
   <value>604800000</value>
</property>

<!-- to enable compression -->
<property>
    <name>hbase.regionserver.wal.enablecompression</name>
    <value>true</value>
</property>

<!-- to enable value compression -->
<property>
    <name>hbase.regionserver.wal.value.enablecompression</name>
    <value>true</value>
</property>
{noformat}

IntegrationTestLoadCommonCrawl
Input: s3n://commoncrawl/crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/warc/CC-MAIN-20210224165708-20210224195708-00000.warc.gz

|| Mode || WALs aggregate size || Difference ||
| Default | 5,006,963,824 | - |
| Compression enabled, value compression not enabled | 5,006,874,201 | (0.1%) |
| Compression enabled, value compression enabled | 940,657,251 | (81.2%) |

In this test case WAL compression without value compression is not enough. The schema is already optimized for space efficiency: column families and qualifiers are single characters. There is a bit of redundancy that can be reclaimed but the bulky values (web crawl results) dominate. 

> WAL value compression
> ---------------------
>
>                 Key: HBASE-25869
>                 URL: https://issues.apache.org/jira/browse/HBASE-25869
>             Project: HBase
>          Issue Type: Bug
>          Components: Operability, wal
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> WAL storage can be expensive, especially if the cell values represented in the edits are large, consisting of blobs or significant lengths of text. Such WALs might need to be kept around for a fairly long time to satisfy replication constraints on a space limited (or space -contended) filesystem. 
> We have a custom dictionary compression scheme for cell metadata that is engaged when WAL compression is enabled in site configuration. This is fine for that application, where we can expect the universe of values (and their lengths) in the custom dictionaries to be constrained. For arbitrary values it is better to use Deflate compression, which is a complete LZ-class algorithm suitable for arbitrary albeit compressible data, is reasonably fast, certainly fast enough for WALs, compresses well, and is universally available as part of the Java runtime. 
> With a trick that encodes whether or not the cell value is compressed in the high order bit of the type byte, this can be done in a backwards compatible manner. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)