You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Gang Wu <ga...@apache.org> on 2018/09/18 20:41:21 UTC

[Discussion] Base 128 variable integer encoding is not always good

Hi,

We are using zstd as the default compressor in production for ORC. Overall
the performance is very good. Through our analysis, there is some room of
improvement for integers.

As we know, all integers use base 128 varint encoding (a.k.a LEB128) after
RLE. This works well for zlib and other compressors. However, when we use
zstd, LEB128-encoded data leads to worse result than fixed 64-bit int64_t.
I have created an issue in zstd community and get confirmed:
https://github.com/facebook/zstd/issues/1325.

To provide some data, we have an ORC file with 10 columns (4 long types and
6 string types). All 4 long columns do not fit for RLE very well, meaning
that most of them are literals in the RLE output. The overall size for
different settings are as below:

   - RLEv1 + LEB128: 8991617 bytes
   - RLEv2 + LEB128: 8305585 bytes
   - RLEv1 + fixed 64-bit: 7961360 bytes

I tried to analyze the one column of the file and got the following result:

   - RLEv1 + zstd + LEB128: 1188651 bytes
   - RLEv1 + zstd + fixed 64-bit: 685522 bytes
   - RLEv1 + zlib + LEB128: 834729 bytes
   - RLEv1 + zlib + fixed 64-bit: 854529 bytes

From above observation, we find that it is better to disable LEB128
encoding while zstd is used. This can be easily achieved by bumping the
file version. Any thoughts?

Thanks!
Gang

Re:Re: [Discussion] Base 128 variable integer encoding is not always good

Posted by Xiening Dai <xi...@alibaba-inc.com>.
I think here the bigger issue is the combination of zstd and LEB128 which results in much lower compression ratio compared to Zlib. This is by design for zstd level 1.And according to the answer from zstd community (see link from Gang), this only gets better after much higher level (says 12).
I think fundermentally we should consider compression codec as a factor when we do the encoding. Certain encoding mechanism work better with some compression algorithms, and the writer should pick the best encoder based on compressor user chooses. 
Regarding to this specific issue, we could just disable LEB when zstd is chosen. But this would have to introduce a new type of stream, something like direct_v1_no_leb. We probably should extend the meta to better represent these differences.

from Alimail iPhone ------------------Original Mail ------------------Sender:Owen O'Malley <ow...@gmail.com>Send Date:Tue Sep 18 16:08:40 2018Recipients:Gopal Vijayaraghavan <go...@apache.org>CC: <de...@orc.apache.org>, Xiening Dai <xi...@alibaba-inc.com>Subject:Re: [Discussion] Base 128 variable integer encoding is not always goodGang,   As you correctly point out, some columns don't work well with RLE. Unfortunately, without being able to look at the data it is hard for me to guess what the right compression strategies are. Based on your description, I would guess that the data doesn't have a lot of patterns to it and covers the majority of the 64 bit integer space. I think the best approach would be to make sure that RLEv3 has a low overhead representation of literals. So a literal mode something like:
header: 2 bytes (literal, 512 values, size 64bit)
data: 512 * 8 bytes
So the overhead would be roughly 2/4096 = 0.005.
Thoughts?

On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <go...@apache.org> wrote:
Hi,



>  From above observation, we find that it is better to disable LEB128 encoding while zstd is used.



You can enable file size optimizations (automatically recommend better layouts for compression) when 



"orc.encoding.strategy"="COMPRESSION"



There are a bunch of bitpacking loops that's controlled by that flag already.



>     https://github.com/facebook/zstd/issues/1325.



If I understand that correctly, a DIRECT_V2 would also work fine for the numeric sequences in Zstd instead?



Cheers,

Gopal









Re: [Discussion] Base 128 variable integer encoding is not always good

Posted by Gang Wu <us...@gmail.com>.
Owen,

Yes, you are correct. I misunderstood RLEv2 which does not use LEB128.

To answer your question:
1. RLEv1 + fixed 8 byte in my experiment means that we don't do LEB128
encoding for RLE literals and directly write fixed 8 bytes in little
endian.
2. The data is from our production data which is primary key of a table and
it is naturally sorted.

To summarize my investigation. We have some datasets (which are not good
for RLE) that indicates that disabling RLE (directly write 64-bit integers
in little endian) is better than either RLEv1 or RLEv2 when compressor is
zstd. Please see the chart below:
dataset NO RLE RLEv1 RLEv1 + fixed 64bit RLEv2
1 638,920 1,188,651 685,522 533,710
2 801,544 985,595 871,753 929,763
3 928,168 1,271,394 1,024,290 1,282,509
4 856,264 987,859 895,738 1,000,487

I agree that we shouldn't break compression+encoding in ORC1. And this is
exactly a good opportunity for RLEv3 in ORC2 to improve. IMO, the overhead
is brought by three factors:
1. zigzag encoding to encode sign. Its overhead is fairly low - according
to my experiment - 2%. This cannot be discarded as we need an approach to
encode sign.
2. RLE headers. Comparing `NO RLE` and `RLEv1 + fixed 64bit` columns in the
above chart, we can get roughly the overhead of RLE headers after
compression.
3. The largest overhead is the LEB128 in RLEv1 and BitPacking in RLEv2
DIRECT mode which break the regularity of ZSTD.

I will do more investigations and try to use the benchmark tools. Will post
here if I have some new findings.

Thanks,
Gang

On Wed, Sep 19, 2018 at 3:25 PM Owen O'Malley <ow...@gmail.com>
wrote:

> Thanks for the sample data.
>
> Just out of curiosity, is the natural data actually sorted like that?
>
> I think you have a misunderstanding of RLEv2. It doesn't use LEB128 except
> for the values in the header. What does RLEv1 + fixed 8 byte mean?
>
> Based on the 512 values that you posted, I see:
>
> 512 values, min = 16430, max = 2403786, minDelta = 6, maxDelta = 78612
> order = incr, bitLen = 22, deltaBitLen = 17
>
> so the RLEv2 should have used the delta encoding. RLEv2 should have used 24
> bit for each of the values in the encoding. Although with the bitLen and
> deltaBitLen both between 16 and 24 bits, the delta encoding doesn't help
> much. Anyways looking at what those 512 sample numbers will look like in
> RLEv2:
>
> header: 2 bytes
> base: 3 bytes
> firstDelta: 3 bytes
> rest: 510 * 3 bytes
> total: 1538 bytes
>
> compared the direct encoding of 512 * 8 bytes = 4096 bytes. The RLEv2 is at
> 38% of the direct 8 byte encoding. Even if the data wasn't sorted, it
> should end up with patched base with a similar size (~3 bytes/value).
>
> Part of the reason that we don't use the odd bit sizes that are defined for
> RLEv2 was precisely because zlib didn't compress well with the non-byte
> aligned data. Have you tried extending the java/benchmarks with zstd to see
> what happens with other data sets? I guess you could add an ORC0 option to
> the benchmark to compare RLEv1 to RLEv2 under each of the compression
> codecs.
>
> This is a great conversation and I think it will be great for the RLEv3 and
> ORC 2 format. In ORC 1, I don't think we should use the compression codec
> as a factor to disable or change the RLE, especially based on a single data
> set. I'd be more tempted to use zstd with the level set high enough that it
> is useful on RLE data, since that doesn't break any old readers.
>
> .. Owen
>
> On Tue, Sep 18, 2018 at 8:48 PM Gang Wu <us...@gmail.com> wrote:
>
> > Owen
> >   I have put the example data to reproduce the issue in
> > https://github.com/facebook/zstd/issues/1325. It contains 512 unsigned
> > numbers which are already zigzag-encoded using (val « 1) ^ (val » 63).
> The
> > low overhead representation of literals is exactly what we need for
> RLEv3.
> > We should also pay attention that zstd does not work well with LEB128 but
> > zlib can get better compression ratio with LEB128. There is no
> one-for-all
> > solution and we may come up with several optimal combinations of encoding
> > and compression settings.
> >
> > Gopal
> >   DIRECT_V2 is RLEv2 which can alleviate the issue but not resolve it. I
> > will take a look at the orc.encoding.strategy setting.
> >
> > Thanks!
> > Gang
> >
> > On Tue, Sep 18, 2018 at 4:08 PM Owen O'Malley <ow...@gmail.com>
> > wrote:
> >
> > > Gang,
> > >    As you correctly point out, some columns don't work well with RLE.
> > > Unfortunately, without being able to look at the data it is hard for me
> > to
> > > guess what the right compression strategies are. Based on your
> > description,
> > > I would guess that the data doesn't have a lot of patterns to it and
> > covers
> > > the majority of the 64 bit integer space. I think the best approach
> would
> > > be to make sure that RLEv3 has a low overhead representation of
> literals.
> > > So a literal mode something like:
> > >
> > > header: 2 bytes (literal, 512 values, size 64bit)
> > > data: 512 * 8 bytes
> > >
> > > So the overhead would be roughly 2/4096 = 0.005.
> > >
> > > Thoughts?
> > >
> > > On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <
> gopalv@apache.org>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > >  From above observation, we find that it is better to disable
> LEB128
> > > > encoding while zstd is used.
> > > >
> > > > You can enable file size optimizations (automatically recommend
> better
> > > > layouts for compression) when
> > > >
> > > > "orc.encoding.strategy"="COMPRESSION"
> > > >
> > > > There are a bunch of bitpacking loops that's controlled by that flag
> > > > already.
> > > >
> > > > >     https://github.com/facebook/zstd/issues/1325.
> > > >
> > > > If I understand that correctly, a DIRECT_V2 would also work fine for
> > the
> > > > numeric sequences in Zstd instead?
> > > >
> > > > Cheers,
> > > > Gopal
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: [Discussion] Base 128 variable integer encoding is not always good

Posted by Owen O'Malley <ow...@gmail.com>.
Thanks for the sample data.

Just out of curiosity, is the natural data actually sorted like that?

I think you have a misunderstanding of RLEv2. It doesn't use LEB128 except
for the values in the header. What does RLEv1 + fixed 8 byte mean?

Based on the 512 values that you posted, I see:

512 values, min = 16430, max = 2403786, minDelta = 6, maxDelta = 78612
order = incr, bitLen = 22, deltaBitLen = 17

so the RLEv2 should have used the delta encoding. RLEv2 should have used 24
bit for each of the values in the encoding. Although with the bitLen and
deltaBitLen both between 16 and 24 bits, the delta encoding doesn't help
much. Anyways looking at what those 512 sample numbers will look like in
RLEv2:

header: 2 bytes
base: 3 bytes
firstDelta: 3 bytes
rest: 510 * 3 bytes
total: 1538 bytes

compared the direct encoding of 512 * 8 bytes = 4096 bytes. The RLEv2 is at
38% of the direct 8 byte encoding. Even if the data wasn't sorted, it
should end up with patched base with a similar size (~3 bytes/value).

Part of the reason that we don't use the odd bit sizes that are defined for
RLEv2 was precisely because zlib didn't compress well with the non-byte
aligned data. Have you tried extending the java/benchmarks with zstd to see
what happens with other data sets? I guess you could add an ORC0 option to
the benchmark to compare RLEv1 to RLEv2 under each of the compression
codecs.

This is a great conversation and I think it will be great for the RLEv3 and
ORC 2 format. In ORC 1, I don't think we should use the compression codec
as a factor to disable or change the RLE, especially based on a single data
set. I'd be more tempted to use zstd with the level set high enough that it
is useful on RLE data, since that doesn't break any old readers.

.. Owen

On Tue, Sep 18, 2018 at 8:48 PM Gang Wu <us...@gmail.com> wrote:

> Owen
>   I have put the example data to reproduce the issue in
> https://github.com/facebook/zstd/issues/1325. It contains 512 unsigned
> numbers which are already zigzag-encoded using (val « 1) ^ (val » 63). The
> low overhead representation of literals is exactly what we need for RLEv3.
> We should also pay attention that zstd does not work well with LEB128 but
> zlib can get better compression ratio with LEB128. There is no one-for-all
> solution and we may come up with several optimal combinations of encoding
> and compression settings.
>
> Gopal
>   DIRECT_V2 is RLEv2 which can alleviate the issue but not resolve it. I
> will take a look at the orc.encoding.strategy setting.
>
> Thanks!
> Gang
>
> On Tue, Sep 18, 2018 at 4:08 PM Owen O'Malley <ow...@gmail.com>
> wrote:
>
> > Gang,
> >    As you correctly point out, some columns don't work well with RLE.
> > Unfortunately, without being able to look at the data it is hard for me
> to
> > guess what the right compression strategies are. Based on your
> description,
> > I would guess that the data doesn't have a lot of patterns to it and
> covers
> > the majority of the 64 bit integer space. I think the best approach would
> > be to make sure that RLEv3 has a low overhead representation of literals.
> > So a literal mode something like:
> >
> > header: 2 bytes (literal, 512 values, size 64bit)
> > data: 512 * 8 bytes
> >
> > So the overhead would be roughly 2/4096 = 0.005.
> >
> > Thoughts?
> >
> > On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <go...@apache.org>
> > wrote:
> >
> > > Hi,
> > >
> > > >  From above observation, we find that it is better to disable LEB128
> > > encoding while zstd is used.
> > >
> > > You can enable file size optimizations (automatically recommend better
> > > layouts for compression) when
> > >
> > > "orc.encoding.strategy"="COMPRESSION"
> > >
> > > There are a bunch of bitpacking loops that's controlled by that flag
> > > already.
> > >
> > > >     https://github.com/facebook/zstd/issues/1325.
> > >
> > > If I understand that correctly, a DIRECT_V2 would also work fine for
> the
> > > numeric sequences in Zstd instead?
> > >
> > > Cheers,
> > > Gopal
> > >
> > >
> > >
> > >
> >
>

Re: [Discussion] Base 128 variable integer encoding is not always good

Posted by Gang Wu <us...@gmail.com>.
Owen
  I have put the example data to reproduce the issue in
https://github.com/facebook/zstd/issues/1325. It contains 512 unsigned
numbers which are already zigzag-encoded using (val « 1) ^ (val » 63). The
low overhead representation of literals is exactly what we need for RLEv3.
We should also pay attention that zstd does not work well with LEB128 but
zlib can get better compression ratio with LEB128. There is no one-for-all
solution and we may come up with several optimal combinations of encoding
and compression settings.

Gopal
  DIRECT_V2 is RLEv2 which can alleviate the issue but not resolve it. I
will take a look at the orc.encoding.strategy setting.

Thanks!
Gang

On Tue, Sep 18, 2018 at 4:08 PM Owen O'Malley <ow...@gmail.com>
wrote:

> Gang,
>    As you correctly point out, some columns don't work well with RLE.
> Unfortunately, without being able to look at the data it is hard for me to
> guess what the right compression strategies are. Based on your description,
> I would guess that the data doesn't have a lot of patterns to it and covers
> the majority of the 64 bit integer space. I think the best approach would
> be to make sure that RLEv3 has a low overhead representation of literals.
> So a literal mode something like:
>
> header: 2 bytes (literal, 512 values, size 64bit)
> data: 512 * 8 bytes
>
> So the overhead would be roughly 2/4096 = 0.005.
>
> Thoughts?
>
> On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
> > Hi,
> >
> > >  From above observation, we find that it is better to disable LEB128
> > encoding while zstd is used.
> >
> > You can enable file size optimizations (automatically recommend better
> > layouts for compression) when
> >
> > "orc.encoding.strategy"="COMPRESSION"
> >
> > There are a bunch of bitpacking loops that's controlled by that flag
> > already.
> >
> > >     https://github.com/facebook/zstd/issues/1325.
> >
> > If I understand that correctly, a DIRECT_V2 would also work fine for the
> > numeric sequences in Zstd instead?
> >
> > Cheers,
> > Gopal
> >
> >
> >
> >
>

Re: [Discussion] Base 128 variable integer encoding is not always good

Posted by Owen O'Malley <ow...@gmail.com>.
Gang,
   As you correctly point out, some columns don't work well with RLE.
Unfortunately, without being able to look at the data it is hard for me to
guess what the right compression strategies are. Based on your description,
I would guess that the data doesn't have a lot of patterns to it and covers
the majority of the 64 bit integer space. I think the best approach would
be to make sure that RLEv3 has a low overhead representation of literals.
So a literal mode something like:

header: 2 bytes (literal, 512 values, size 64bit)
data: 512 * 8 bytes

So the overhead would be roughly 2/4096 = 0.005.

Thoughts?

On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <go...@apache.org>
wrote:

> Hi,
>
> >  From above observation, we find that it is better to disable LEB128
> encoding while zstd is used.
>
> You can enable file size optimizations (automatically recommend better
> layouts for compression) when
>
> "orc.encoding.strategy"="COMPRESSION"
>
> There are a bunch of bitpacking loops that's controlled by that flag
> already.
>
> >     https://github.com/facebook/zstd/issues/1325.
>
> If I understand that correctly, a DIRECT_V2 would also work fine for the
> numeric sequences in Zstd instead?
>
> Cheers,
> Gopal
>
>
>
>

Re: [Discussion] Base 128 variable integer encoding is not always good

Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi,

>  From above observation, we find that it is better to disable LEB128 encoding while zstd is used.

You can enable file size optimizations (automatically recommend better layouts for compression) when 

"orc.encoding.strategy"="COMPRESSION"

There are a bunch of bitpacking loops that's controlled by that flag already.

>     https://github.com/facebook/zstd/issues/1325.

If I understand that correctly, a DIRECT_V2 would also work fine for the numeric sequences in Zstd instead?

Cheers,
Gopal