You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/27 23:22:43 UTC

[GitHub] [arrow-rs] marioloko opened a new pull request, #2959: Pass decompressed size to parquet Codec::decompress (#2956)

marioloko opened a new pull request, #2959:
URL: https://github.com/apache/arrow-rs/pull/2959

# Which issue does this PR close?

Closes #2956.

# Rationale for this change

# What changes are included in this PR?

Added optional argument `uncompressed_size` to `Coded::decompress` to do a better estimation of the required uncompress size.

* snappy: Probably no much improvement as `decompress_len` is already accurate.
* gzip: No improvement. Ignores the size hint.
* brotli: Probably no much improvement. The buffer size will be equal to the uncompressed_size size.
* lz4: No improvement. As the buffer is located at the stack there are no extra allocations. Then, it probably is better to keep it working as it is.
* zstd: No improvement. Ignores the size hint.
* lz4_raw: Improvement. The estimation method over-estimates, so knowin the uncompressed size reduces allocations.

# Are there any user-facing changes?

Breaking changes on `Codec` trait. It only affects to users with `experimental` feature enable.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] ursabot commented on pull request #2959: Pass decompressed size to parquet Codec::decompress (#2956)

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #2959:
URL: https://github.com/apache/arrow-rs/pull/2959#issuecomment-1295709983

   Benchmark runs are scheduled for baseline = 94a7f4b69901754126186f4e18d08d59af76088e and contender = 344c552d701374582ac1aff198e62acb9907afb6. 344c552d701374582ac1aff198e62acb9907afb6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/25ac84e4c74c4bde8cf74dd2abe3f747...f8a6e04b448343b39f7b6deaf5a2a4b3/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/60252ee960984d80ac453714601f30b8...9976375b922b43f983cf207bc0772235/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/917d9ea066d3420fbddcd22e705d63b3...7133a3b8aae44b069db6caa139956a32/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f5c0c6ebdb9746e79077b0927980f297...b78f531e29fd4d1da68934acaa050ae0/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold merged pull request #2959: Pass decompressed size to parquet Codec::decompress (#2956)

Posted by GitBox <gi...@apache.org>.

tustvold merged PR #2959:
URL: https://github.com/apache/arrow-rs/pull/2959


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] marioloko commented on pull request #2959: Pass decompressed size to parquet Codec::decompress (#2956)

Posted by GitBox <gi...@apache.org>.

marioloko commented on PR #2959:
URL: https://github.com/apache/arrow-rs/pull/2959#issuecomment-1295330273

   It seems that the estimation of lz4 uncompress size can cause overflow for small compress size. Any compress size smaller than 10 will overflow and as though it will panic.
   
   So I see too options now:
   1. To change predictions formula to return 255 for any compressed size smaller than 10.
   2. To only allow lz4_raw if `uncompressed_size` is provided, and return an error saying 'LZ4_RAW without known uncompressed_size is unsupported'.
   
   I would go with the second one, as even if the overflow error is only for small compression sizes, if the compressed size is 1G it will reserve ~250GB which is too much. So I would avoid prediction.
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org