You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "pitrou (via GitHub)" <gi...@apache.org> on 2023/01/26 15:04:49 UTC

[GitHub] [arrow] pitrou opened a new issue, #33885: [C++] Improve compression strategy in IPC, Parquet

pitrou opened a new issue, #33885:
URL: https://github.com/apache/arrow/issues/33885

   ### Describe the enhancement requested
   
   Both Arrow IPC and the Parquet format allow optional compression of data buffers.
   Currently, the heuristic used in the Arrow C++ codebase is simple: try to compress the entire data buffer, and write the compressed data if savings are achieved, otherwise write the uncompressed data (to save on decompression costs).
   
   However, this heuristic always pays the full cost of compression even for uncompressible data (and compression is usually much more costly than decompression). This could be improved by employing a sampling strategy to reduce the cost of attempting to compress uncompressible data.
   
   We could for example find inspiration in [Dask distributed's compression strategy](https://github.com/dask/distributed/blob/0063de53fed5e4e2e409940213c6265867e6635d/distributed/protocol/compression.py#L153).
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #33885: [C++] Improve compression strategy in IPC, Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #33885:
URL: https://github.com/apache/arrow/issues/33885#issuecomment-1409827729

   I'm interested in this patch, but after gothrough the code, I find that:
   1. In parquet, `PageWriter::Open` always choose the user-setting or default Compression, it will not do this adaptive
   2. In parquet, a ColumnChunk always use same compression, so, we need to decide whether using compression when writing the first page. And if we're using DICTIONARY, it will be much more trickey, because dict is PLAIN and rely-on compression. 
   
   Seems it would be trickey here, should we sample first page and decide using compression or not by the first page?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #33885: [C++] Improve compression strategy in IPC, Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #33885:
URL: https://github.com/apache/arrow/issues/33885#issuecomment-1405146226

   (originally discussed as part of https://github.com/apache/arrow/pull/15194)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org