You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/10/28 09:11:00 UTC

[jira] [Commented] (HADOOP-13126) Add Brotli compression codec

    [ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625537#comment-17625537 ] 

ASF GitHub Bot commented on HADOOP-13126:
-----------------------------------------

ibobak commented on PR #2723:
URL: https://github.com/apache/hadoop/pull/2723#issuecomment-1294738611

   Colleagues, 
   
   I've taken the source code from this commit  https://github.com/apache/hadoop/pull/2723/commits/47f05930c2f5c576a6c25238c187bdf3409b8f23 
   
   made a jar of it, plugged it into my Spark cluster, launched a huge job with many transformations and actions, and found that there is a serious memory leak: executors consume RAM more and more (no matter that there is a limitation of 20GB, they consumed 40GB).
   
   I've made my own version of Brotli codec (also based on brotli4j) by looking at how Snappy and others are made, and it works with no memory leaks.  Soon I'll post my PR.




> Add Brotli compression codec
> ----------------------------
>
>                 Key: HADOOP-13126
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13126
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 2.7.2
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HADOOP-13126.1.patch, HADOOP-13126.2.patch, HADOOP-13126.3.patch, HADOOP-13126.4.patch, HADOOP-13126.5.patch
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> I've been testing [Brotli|https://github.com/google/brotli/], a new compression library based on LZ77 from Google. Google's [brotli benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf] look really good and we're also seeing a significant improvement in compression size, compression speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet --compression-codec snappy --overwrite                      
> real    1m17.106s
> user    1m30.804s
> sys     0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet --compression-codec brotli --overwrite                         
> real    1m16.640s
> user    1m24.244s
> sys     0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet --compression-codec gzip --overwrite                            
> real    3m39.496s
> user    3m48.736s
> sys     0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. Another test resulted in a slightly larger Brotli file than gzip produced, but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT license|https://github.com/google/brotli/blob/master/LICENSE], and the [JNI library jbrotli is ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org