You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2022/04/30 01:48:00 UTC

[jira] [Comment Edited] (HBASE-26330) Document new provided compression codecs

    [ https://issues.apache.org/jira/browse/HBASE-26330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530297#comment-17530297 ] 

Andrew Kyle Purtell edited comment on HBASE-26330 at 4/30/22 1:47 AM:
----------------------------------------------------------------------

I performed a definitive microbenchmark for this issue. See https://github.com/apurtell/jmh-compression-tests .

In [blockdata.zip|https://github.com/apurtell/jmh-compression-tests/blob/master/src/main/resources/blockdata.zip] is 256 MB (258,126,022 bytes exactly) of block data extracted from two HFiles after processing some Common Crawl Data with IntegrationLoadTestCommonCrawl, 2,680 blocks in total. This data was processed by each new codec implementation as if the block data were being compressed again for write into an HFile, but without writing any data, comparing only the CPU time and resource demand of the codec itself. Java 11, Linux aarch64 VM, Apple M1 Max silicon, but relative differences are what are interesting. Measured time is the average time in milliseconds required to compress all blocks of the ~256MB file. This is how long it would take to write an HFile containing these contents, minus the I/O overhead of block encoding and actual persistence. 

These are the results:

||Codec||Level||Time (milliseconds)||Result (bytes)||Improvement||
|AirCompressor LZ4|-|349.989 ± 2.835|76,999,408|70.17%|
|AirCompressor LZO|-|334.554 ± 3.243|79,369,805|69.25%|
|AirCompressor Snappy|-|364.153 ± 19.718|80,201,763|68.93%|
|AirCompressor Zstandard|3 (effective)|1108.267 ± 8.969|55,129,189|78.64%|
|Brotli|1|593.107 ± 2.376|58,672,319|77.27%|
|Brotli|3|1345.195 ± 27.327|53,917,438|79.11%|
|Brotli|6|2812.411 ± 25.372|48,696,441|81.13%|
|Brotli|10|74615.936 ± 224.854|44,970,710|82.58%|
|LZ4|-|303.045 ± 0.783|76,974,364|70.18%|
|LZMA|1|6410.428 ± 115.065|49,948,535|80.65%|
|LZMA|3|8144.620 ± 152.119|49,109,363|80.97%|
|LZMA|6|43802.576 ± 382.025|46,951,810|81.81%|
|LZMA|9|49821.979 ± 580.110|46,951,810|81.81%|
|Xerial Snappy|-|360.225 ± 2.324|80,749,937|68.72%|
|Zstandard|1|654.699 ± 16.839|56,719,994|78.03%|
|Zstandard|3|839.160 ± 24.906|54,573,095|78.86%|
|Zstandard|5|1594.373 ± 22.384|52,025,485|79.84%|
|Zstandard|7|2308.705 ± 24.744|50,651,554|80.38%|
|Zstandard|9|3659.677 ± 58.018|50,208,425|80.55%|
|Zstandard|12|8705.294 ± 58.080|49,841,446|80.69%|
|Zstandard|15|19785.646 ± 278.080|48,499,508|81.21%|
|Zstandard|18|47702.097 ± 442.670|48,319,879|81.28%|
|Zstandard|22|97799.695 ± 1106.571|48,212,220|81.32%|

Schema design and configuration guidelines at my employer are informed by these results and similar results from earlier measurements: 
- Compression enabled by default on all tables.
- LZ4 implementation (org.apache.hadoop.hbase.io.compress.lz4.Lz4Codec) as default.
- If pure Java compression is required, use org.apache.hadoop.hbase.io.compress.aircompressor.Lz4Codec instead.
- For use cases where higher compression efficiencies are required, opt for Zstandard (org.apache.hadoop.hbase.io.compress.zstd.ZstdCodec).
-- Adjust {{ZstdCodec.ZSTD_LEVEL_KEY}} in table or column family schema to fine tune for your data.
-- Level 1 is equivalent to LZ4, but LZ4 will perform better. 
-- Level 3 is a good fast default.
-- Diminishing returns after level 7.
-- Levels 12-22 are not recommended. 
- For web data, Brotli (org.apache.hadoop.hbase.io.compress.brotli.BrotliCodec) is the superior option. 
- When utilizing expensive in time compression, consider setting {{COMPRESSION}} in table and column schema to LZ4 and {{COMPRESSION_COMPACT_MAJOR}} to the more aggressive and expensive option. Thus, only major compaction will incur significant overheads. Flushes and short compactions will remain very fast. Allocate more long compaction threads than the default.
- Enable WAL value compression (HBASE-25869) by default.
-- Set SNAPPY codec implementation to AirCompressor snappy (org.apache.hadoop.hbase.io.compress.aircompressor.SnappyCodec) and configure WAL value compression to use the SNAPPY codec. It has reasonable performance in both space and time and as a pure Java implementation will be universally available.


was (Author: apurtell):
I performed a definitive microbenchmark for this issue. See https://github.com/apurtell/jmh-compression-tests .

In [blockdata.zip|https://github.com/apurtell/jmh-compression-tests/blob/master/src/main/resources/blockdata.zip] is 256 MB (258,126,022 bytes exactly) of block data extracted from two HFiles after processing some Common Crawl Data with IntegrationLoadTestCommonCrawl, 2,680 blocks in total. This data was processed by each new codec implementation as if the block data were being compressed again for write into an HFile, but without writing any data, comparing only the CPU time and resource demand of the codec itself. Java 11, Linux aarch64 VM, Apple M1 Max silicon, but relative differences are what are interesting. Measured time is the average time in milliseconds required to compress all blocks of the ~256MB file. This is how long it would take to write an HFile containing these contents, minus the I/O overhead of block encoding and actual persistence. 

These are the results:

||Codec||Level||Time (milliseconds)||Result (bytes)||Improvement||
|AirCompressor LZ4|-|349.989 ± 2.835|76,999,408|70.17%|
|AirCompressor LZO|-|334.554 ± 3.243|79,369,805|69.25%|
|AirCompressor Snappy|-|364.153 ± 19.718|80,201,763|68.93%|
|AirCompressor Zstandard|3 (effective)|1108.267 ± 8.969|55,129,189|78.64%|
|Brotli|1|593.107 ± 2.376|58,672,319|77.27%|
|Brotli|3|1345.195 ± 27.327|53,917,438|79.11%|
|Brotli|6|2812.411 ± 25.372|48,696,441|81.13%|
|Brotli|10|74615.936 ± 224.854|44,970,710|82.58%|
|LZ4|-|303.045 ± 0.783|76,974,364|70.18%|
|LZMA|1|6410.428 ± 115.065|49,948,535|80.65%|
|LZMA|3|8144.620 ± 152.119|49,109,363|80.97%|
|LZMA|6|43802.576 ± 382.025|46,951,810|81.81%|
|LZMA|9|49821.979 ± 580.110|46,951,810|81.81%|
|Xerial Snappy|-|360.225 ± 2.324|80,749,937|68.72%|
|Zstandard|1|654.699 ± 16.839|56,719,994|78.03%|
|Zstandard|3|839.160 ± 24.906|54,573,095|78.86%|
|Zstandard|5|1594.373 ± 22.384|52,025,485|79.84%|
|Zstandard|7|2308.705 ± 24.744|50,651,554|80.38%|
|Zstandard|9|3659.677 ± 58.018|50,208,425|80.55%|
|Zstandard|12|8705.294 ± 58.080|49,841,446|80.69%|
|Zstandard|15|19785.646 ± 278.080|48,499,508|81.21%|
|Zstandard|18|47702.097 ± 442.670|48,319,879|81.28%|
|Zstandard|22|97799.695 ± 1106.571|48,212,220|81.32%|

Schema design and configuration guidelines at my employer are informed by these results and similar results from earlier measurements: 
- Compression enabled by default on all tables.
- LZ4 implementation (org.apache.hadoop.hbase.io.compress.lz4.Lz4Codec) as default.
- If pure Java compression is required, use org.apache.hadoop.hbase.io.compress.aircompressor.Lz4Codec instead.
- For use cases where higher compression efficiencies are required, opt for Zstandard (org.apache.hadoop.hbase.io.compress.zstd.ZstdCodec).
-- Adjust {{ZstdCodec.ZSTD_LEVEL_KEY}} in table or column family schema to fine tune for your data.
-- Level 1 is equivalent to LZ4, but LZ4 will perform better. 
-- Level 3 is a good fast default.
-- Diminishing returns after level 7.
-- Levels 12-22 are not recommended. 
- For web data, Brotli (org.apache.hadoop.hbase.io.compress.brotli.BrotliCodec) is the superior option. 
- When utilizing expensive in time compression, consider setting {{COMPRESSION}} in table and column schema to LZ4 and {{COMPRESSION_COMPACT_MAJOR}} to the more aggressive and expensive option. Thus, only major compaction will incur significant overheads. Flushes and short compactions will remain very fast. Allocate more long compaction threads than the default.
- Enable WAL value compression (HBASE-25869) by default. Set SNAPPY codec implementation to AirCompressor snappy (org.apache.hadoop.hbase.io.compress.aircompressor.SnappyCodec) and configure WAL value compression to use the SNAPPY codec. It has reasonable performance in both space and time and as a pure Java implementation will be universally available.

> Document new provided compression codecs
> ----------------------------------------
>
>                 Key: HBASE-26330
>                 URL: https://issues.apache.org/jira/browse/HBASE-26330
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Blocker
>             Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> Document the new compression codecs:
> - The configuration keys used for setting the codec implementations for the various algorithms
> - The provided compression codecs
> - Default schema recommendations (LZ4)
> - Default WAL value compression recommendations (Snappy (aircompressor))



--
This message was sent by Atlassian Jira
(v8.20.7#820007)