You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Kazuaki Ishizaki (JIRA)" <ji...@apache.org> on 2017/04/11 02:30:42 UTC

[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

    [ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963751#comment-15963751 ] 

Kazuaki Ishizaki commented on ARROW-300:
----------------------------------------

Current Apache Spark supports [the following compression schemes|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressionScheme.scala#L66] for in-memory columnar storage. Currently, compressed in-memory columnar storage is used when DataFrame.cache or Dataset.cache method is executed.
Would it be possible to support these schemes in addition to LZ4/(current)DictonaryEncoding?

* RunLengthEncoding: Generic run-length encoding (e.g. 1,1,1,2,2,2,2 -> [3, 1], [4, 2])
* IntDelta: Represent a sequence using a base value with byte deltas from previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3])
* LongDelta: Represent a sequence using a base value with byte deltas from previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3])


> [Format] Add buffer compression option to IPC file format
> ---------------------------------------------------------
>
>                 Key: ARROW-300
>                 URL: https://issues.apache.org/jira/browse/ARROW-300
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Format
>            Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer compression setting in the file Footer. Probably only two compressors worth supporting out of the box would be zlib (higher compression ratios) and lz4 (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)