You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2021/05/20 02:07:00 UTC

[jira] [Comment Edited] (HBASE-25869) WAL value compression

    [ https://issues.apache.org/jira/browse/HBASE-25869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342054#comment-17342054 ] 

Andrew Kyle Purtell edited comment on HBASE-25869 at 5/20/21, 2:06 AM:
-----------------------------------------------------------------------

*WAL Compression Results*

Site configuration:

{noformat}
<!-- retain all WALs  -->
<property>
   <name>hbase.master.logcleaner.ttl</name>
   <value>604800000</value>
</property>

<!-- to enable compression -->
<property>
    <name>hbase.regionserver.wal.enablecompression</name>
    <value>true</value>
</property>

<!-- to enable value compression -->
<property>
    <name>hbase.regionserver.wal.value.enablecompression</name>
    <value>true</value>
</property>
{noformat}

IntegrationTestLoadCommonCrawl
Input: s3n://commoncrawl/crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/warc/CC-MAIN-20210224165708-20210224195708-00000.warc.gz

Microbenchmarks are collected with [this change](https://gist.github.com/apurtell/596310d08b5ad75cd9677466d36360e4).
Statistics are collected over the lifetime of the regionserver and are dumped at end of test at shutdown. Statistics are updated under synchronization but this is done in a way that excludes that overhead from measurement. The normal patch does not contain either the instrumentation or the synchronization point. Nanoseconds are converted to milliseconds for the table.

SNAPPY or ZSTD are recommended, all other options provided for comparison. (LZMA is included as a sanity check that indeed an expensive algorithm really is expensive.)

When using SNAPPY or ZSTD we derive a performance benefit due to reduced IO for the large values in the test case. 

|| Mode || WALs aggregate size || WALs aggregate size difference  || WAL writer append time (ms avg) ||
| Default | 5,117,369,553 | - | 0.290 (stdev 0.328) |
| Compression enabled, value compression not enabled | 5,002,683,600 | (2.241%) | 0.372 (stddev 0.336) |
| Compression enabled, value compression enabled, algorithm=SNAPPY | 1,616,387,702 | (68.4%) | 0.027 (stddev 0.204) |
| Compression enabled, value compression enabled, algorithm=ZSTD (best speed) | 1,149,008,133 | (77.55%) | 0.043 (stddev 0.195) |
| Compression enabled, value compression enabled, algorithm=ZSTD (default) | 1,089,241,811 | (78.7%) | 0.056 (stdev 0.310) |
| Compression enabled, value compression enabled, algorithm=ZSTD (best compression) | 941,452,655 | (81.2%) | 0.231 (stddev 1.11) |
| _Options below not recommended._ | - | - | - |
| Compression enabled, value compression enabled, algorithm=GZ | 1,082,414,015 | (78.9%) | 0.267 (stddev 1.325) |
| Compression enabled, value compression enabled, algorithm=LZMA (level 1) | 1,013,951,637 | (80.2%) | 2.157 (stddev 3.302) |
| Compression enabled, value compression enabled, algorithm=LZMA (default) | 940,884,618 | (81.7%) | 4.739 (stdev 8.609) |

In this test case WAL compression without value compression is not enough. The schema is already optimized for space efficiency: column families and qualifiers are single characters. There is a bit of redundancy that can be reclaimed but the bulky values (web crawl results) dominate. 


was (Author: apurtell):
*WAL Compression Results*

Site configuration:

{noformat}
<!-- retain all WALs  -->
<property>
   <name>hbase.master.logcleaner.ttl</name>
   <value>604800000</value>
</property>

<!-- to enable compression -->
<property>
    <name>hbase.regionserver.wal.enablecompression</name>
    <value>true</value>
</property>

<!-- to enable value compression -->
<property>
    <name>hbase.regionserver.wal.value.enablecompression</name>
    <value>true</value>
</property>
{noformat}

IntegrationTestLoadCommonCrawl
Input: s3n://commoncrawl/crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/warc/CC-MAIN-20210224165708-20210224195708-00000.warc.gz

|| Mode || WALs aggregate size || Difference ||
| Default | 5,006,963,824 | - |
| Compression enabled, value compression not enabled | 5,006,874,201 | (0.1%) |
| Compression enabled, value compression enabled | 940,657,251 | (81.2%) |

In this test case WAL compression without value compression is not enough. The schema is already optimized for space efficiency: column families and qualifiers are single characters. There is a bit of redundancy that can be reclaimed but the bulky values (web crawl results) dominate. 

> WAL value compression
> ---------------------
>
>                 Key: HBASE-25869
>                 URL: https://issues.apache.org/jira/browse/HBASE-25869
>             Project: HBase
>          Issue Type: Bug
>          Components: Operability, wal
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> WAL storage can be expensive, especially if the cell values represented in the edits are large, consisting of blobs or significant lengths of text. Such WALs might need to be kept around for a fairly long time to satisfy replication constraints on a space limited (or space-contended) filesystem.
> We have a custom dictionary compression scheme for cell metadata that is engaged when WAL compression is enabled in site configuration. This is fine for that application, where we can expect the universe of values and their lengths in the custom dictionaries to be constrained. For arbitrary cell values it is better to use one of the available compression codecs, which are suitable for arbitrary albeit compressible data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)