You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/09/17 17:42:00 UTC

[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

    [ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617883#comment-16617883 ] 

Wes McKinney commented on ARROW-300:
------------------------------------

Moving this to 0.12. I will make a proposal for compressed record batches after the 0.11 release goes out.

My gut instinct on this would be to create a {{CompressedBuffer}} metadata type and a {{CompressedRecordBatch}} message. Some reasons:

* Does not complicate or bloat the existing RecordBatch message type
* Support buffer-level compression (each buffer can be compressed or not)

Readers can choose to materialize right away or on demand -- in C++, we can create a {{arrow::CompressedRecordBatch}} class if we want that does late materialization.

This does not necessarily accommodate other kinds of type-specific compression, like RLE-encoding, or it might be that RLE can be used on the values buffer of primitive types, e.g.

{code}
CompressedBuffer {
  CompressionType type;
  int64 offset;
  int64 compressed_size;
  int64 uncompressed_size;
}
{code}

So if we wanted to use the Parquet RLE_BITPACKED_HYBRID compression style for integers, say, we could do that.

Another question here is how to handle compressions which may have additional parameters. {{CompressionType}} or {{Compression}} could be a union, but that would make the message sizes larger (but maybe that's OK)

> [Format] Add buffer compression option to IPC file format
> ---------------------------------------------------------
>
>                 Key: ARROW-300
>                 URL: https://issues.apache.org/jira/browse/ARROW-300
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Format
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>            Priority: Major
>             Fix For: 0.12.0
>
>
> It may be useful if data is to be sent over the wire to compress the data buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer compression setting in the file Footer. Probably only two compressors worth supporting out of the box would be zlib (higher compression ratios) and lz4 (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)