You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "andrewthad (via GitHub)" <gi...@apache.org> on 2023/03/14 16:23:02 UTC

[GitHub] [arrow] andrewthad opened a new issue, #34560: FAQ claims that Arrow format is not compressed

andrewthad opened a new issue, #34560:
URL: https://github.com/apache/arrow/issues/34560

### Describe the bug, including details regarding any error messages, version, and platform.

Contrasting arrow with parquet, the arrow FAQ claims:

> Conversely, Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is not compressed (or only lightly so, when using dictionary encoding) but laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed.

This is not accurate. According to `Message.fbs`, buffers may be compressed with LZ4 or ZSTD. Pyarrow supports even more: brotli, gzip, lz4, lz4_raw, snappy, zstd. I've not followed the history of Apache Arrow, but I assume that support for compression was added later and that the FAQ is out of date.

If the FAQ is brought up to date, it would also be good to justify why support for compression was added. My best guess is that a lot of analytics workloads just needed a single beginning-to-end scan of certain columns, and for these workloads, there is no advantage of memory mapping. (Memory mapping would be most helpful if you repeatedly accessed the same data.) In these cases, compression would reduce disk (or network) latency, and since the buffers are compressed individually, you only decompress what you need (unlike parquet). But all of that is just my guess. It would be good to have a more authoritative source make this claim.

### Component(s)

Documentation

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on issue #34560: FAQ claims that Arrow format is not compressed

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on issue #34560:
URL: https://github.com/apache/arrow/issues/34560#issuecomment-1468442318

   The confusion here is between the in-memory and IPC (serialization) formats; the FAQ is trying to contrast the Arrow in-memory format and the Parquet file/on-disk format (i.e., Parquet and Arrow are not directly substitutable). But then Arrow has an IPC format, and then the comparison gets fuzzier, and I think it would be good to clarify things further as you suggest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org