You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by David Phillips <da...@acz.org> on 2018/02/08 05:41:26 UTC

Compressed data format for LZ4

We previously preserved an LZ4 CompressionKind and plan to implement it in
the Presto reader and writer. Unlikely Snappy, the LZ4 format does not
record the uncompressed length. Thus, when reading, we need to allocate an
output buffer that is the full compressionBlockSize. This can waste a lot
of memory when there are many streams and many open readers.

We propose to prefix the LZ4 block with the uncompressed size. I see a few
ways of doing it:

1) Variable length integer, the same as Snappy.
2) Fixed 3-byte integer, little-endian.
3) Fixed 4-byte integer, little-endian.

Option #1 is more complicated, uses more CPU to decode, and probably
doesn't save much space; buffers starting at 16kB will use 3 bytes.

Option #2 restricts the maximum size to be 16MB-1 byte. This is
ridiculously large for a per-stream buffer and not a problem as current
writers cap the buffer size at a reasonable 256kB, so it shouldn't be a
problem in practice, but it's worth calling out here.

Option #3 is flexible but in practice will waste a byte.

My vote is for option #2.

Re: Compressed data format for LZ4

Posted by David Phillips <da...@acz.org>.

Thanks for the pointer. I didn't realize that was already implemented.

I think the 3-byte header you are talking about is for the compressed size
(and the "isOriginal" flag). I was asking about adding the uncompressed
size as part of the compressor-specific compressed data block (like Snappy
has). However, since Apache ORC already implements it, the compressed data
format has already been decided.

On Wed, Feb 7, 2018 at 9:58 PM, Owen O'Malley <ow...@gmail.com>
wrote:

>    In general this is probably better on dev@orc, but this works. ORC-77
> (62fe9504b) implemented the LZ4 codec using airlift. The structure is the
> same as the other codecs and it always uses a 3 byte header (#2).
>
>

Re: Compressed data format for LZ4

Posted by Owen O'Malley <ow...@gmail.com>.

Hi David,
   In general this is probably better on dev@orc, but this works. ORC-77
(62fe9504b) implemented the LZ4 codec using airlift. The structure is the
same as the other codecs and it always uses a 3 byte header (#2).

.. Owen

On Wed, Feb 7, 2018 at 9:41 PM, David Phillips <da...@acz.org> wrote:

> We previously preserved an LZ4 CompressionKind and plan to implement it in
> the Presto reader and writer. Unlikely Snappy, the LZ4 format does not
> record the uncompressed length. Thus, when reading, we need to allocate an
> output buffer that is the full compressionBlockSize. This can waste a lot
> of memory when there are many streams and many open readers.
>
> We propose to prefix the LZ4 block with the uncompressed size. I see a few
> ways of doing it:
>
> 1) Variable length integer, the same as Snappy.
> 2) Fixed 3-byte integer, little-endian.
> 3) Fixed 4-byte integer, little-endian.
>
> Option #1 is more complicated, uses more CPU to decode, and probably
> doesn't save much space; buffers starting at 16kB will use 3 bytes.
>
> Option #2 restricts the maximum size to be 16MB-1 byte. This is
> ridiculously large for a per-stream buffer and not a problem as current
> writers cap the buffer size at a reasonable 256kB, so it shouldn't be a
> problem in practice, but it's worth calling out here.
>
> Option #3 is flexible but in practice will waste a byte.
>
> My vote is for option #2.
>
>