You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/09/04 04:34:00 UTC

[jira] [Commented] (PARQUET-2184) Improve SnappyCompressor buffer expansion performance

    [ https://issues.apache.org/jira/browse/PARQUET-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600012#comment-17600012 ] 

ASF GitHub Bot commented on PARQUET-2184:
-----------------------------------------

abaranec opened a new pull request, #993:
URL: https://github.com/apache/parquet-mr/pull/993

   This PR improves the allocation behavior of SnappyCompressor.  Previously when more buffer space was needed, it would only allocate enough for the new data to be written.  Now, it will double the internal buffer size up to 8MB, and then afterwards increase size in 1MB increments.
   
   No additional unit tests are added, as the existing unit tests for SnappyCodec and other already verify correctness.  I have personally verified the performance gains using JMH benchmarks.




> Improve SnappyCompressor buffer expansion performance
> -----------------------------------------------------
>
>                 Key: PARQUET-2184
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2184
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.13.0
>            Reporter: Andrew Baranec
>            Priority: Minor
>
> The existing implementation of SnappyCompressor will only allocate enough bytes for the buffer passed into setInput().  This leads to suboptimal performance when there are patterns of writes that cause repeated buffer expansions.  In the worst case it must copy the entire buffer for every single invocation of setInput()
> Instead of allocating a buffer of size current + write length,  there should be an expansion strategy that reduces the amount of copying required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)