You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Ilya Kasnacheev <il...@gmail.com> on 2018/09/03 17:36:29 UTC

Re: Compression prototype

Hello again!

I've been running various compression parameters through cod dataset.

It looks like the best compression level in terms of speed is either 1 or 2.
The default for Zstd seems to be 3 which would almost always perform worse.
For best performance a dictionary of 1024 is optimal, for better compression
one might choose larger dictionaries, 6k looks good but I will also run a
few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample size
is set to more than 16k entries (I guess I should probe the max buffer size
where problems begin).

I'm attaching two charts which show what's we've got. Compression rate is a
fraction of original records size. Time to run is wall clock time the test
run. Reasonable compression will increase the run time twofold (of a program
that only does text record parsing -> creates objects -> binarylizes them ->
compresses -> decompresses). Notation: s{number of bin objects used to
train}-d{dictionary length in bytes}-l{compression level}.
<http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart1.png> 
Second one is basically a zoom in on the first.
<http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart2.png> 
I think that in additional to dictionary compression we should have
dictionary-less compression. On typical data of small records it shows
compression rate of 0.8 ~ 0.65, but I can imagine that with larger
unstructured records it can be as good as dict-based and much less of a
hassle dictionary-processing-wise. WDYT?
Sorry for the fine prints. I hope my charts will visible.

You can see the updated code as pull request:
https://github.com/apache/ignite/pull/4673

Regards,



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

Re: Compression prototype

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Of course, this setting will be configurable.

Regards,
-- 
Ilya Kasnacheev


ср, 5 сент. 2018 г. в 3:21, Dmitriy Setrakyan <ds...@apache.org>:

> In my view, dictionary of 1024 bytes is not going to be nearly enough.
>
> On Tue, Sep 4, 2018 at 8:06 AM, Ilya Kasnacheev <ilya.kasnacheev@gmail.com
> >
> wrote:
>
> > Hello!
> >
> > In case of Apache Ignite, most of savings is due to BinaryObject format,
> > which encodes types and fields with byte sequences. Any enum/string flags
> > will also get in dictionary. And then as it processes a record it fills
> up
> > its individual dictionary.
> >
> > But, in one cache, most if not all entries have identical BinaryObject
> > layout so a tiny dictionary covers that case. Compression algorithms are
> > not very keen on large dictionaries, preferring to work with local
> > regularities in byte stream.
> >
> > E.g. if we have large entries in cache with low BinaryObject overhead,
> > they're served just fine by "generic" compression.
> >
> > All of the above is my speculations, actually. I just observe that on a
> > large data set, compression ratio is around 0.4 (2.5x) with a dictionary
> of
> > 1024 bytes. The rest is black box.
> >
> > Regards,
> > --
> > Ilya Kasnacheev
> >
> >
> > вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <
> > ilya.kasnacheev@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hello!
> > > >
> > > > Each node has a local dictionary (per node currently, per cache
> > planned).
> > > > Dictionary is never shared between nodes. As data patterns shift,
> > > > dictionary rotation is also planned.
> > > >
> > > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine
> > It
> > > is
> > > > enough to store common BinaryObject boilerplate, and everything else
> is
> > > > compressed on the fly. The source sample is 16k records.
> > > >
> > > >
> > > Thanks, Ilya, understood. I think per-cache is a better idea. However,
> I
> > > have a question about dictionary size. Ignite stores TBs of data. How
> do
> > > you plan the dictionary to fit in 1K bytes?
> > >
> > > D.
> > >
> >
>

Re: Compression prototype

Posted by Dmitriy Setrakyan <ds...@apache.org>.
In my view, dictionary of 1024 bytes is not going to be nearly enough.

On Tue, Sep 4, 2018 at 8:06 AM, Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> In case of Apache Ignite, most of savings is due to BinaryObject format,
> which encodes types and fields with byte sequences. Any enum/string flags
> will also get in dictionary. And then as it processes a record it fills up
> its individual dictionary.
>
> But, in one cache, most if not all entries have identical BinaryObject
> layout so a tiny dictionary covers that case. Compression algorithms are
> not very keen on large dictionaries, preferring to work with local
> regularities in byte stream.
>
> E.g. if we have large entries in cache with low BinaryObject overhead,
> they're served just fine by "generic" compression.
>
> All of the above is my speculations, actually. I just observe that on a
> large data set, compression ratio is around 0.4 (2.5x) with a dictionary of
> 1024 bytes. The rest is black box.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <ds...@apache.org>:
>
> > On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <
> ilya.kasnacheev@gmail.com
> > >
> > wrote:
> >
> > > Hello!
> > >
> > > Each node has a local dictionary (per node currently, per cache
> planned).
> > > Dictionary is never shared between nodes. As data patterns shift,
> > > dictionary rotation is also planned.
> > >
> > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine
> It
> > is
> > > enough to store common BinaryObject boilerplate, and everything else is
> > > compressed on the fly. The source sample is 16k records.
> > >
> > >
> > Thanks, Ilya, understood. I think per-cache is a better idea. However, I
> > have a question about dictionary size. Ignite stores TBs of data. How do
> > you plan the dictionary to fit in 1K bytes?
> >
> > D.
> >
>

Re: Compression prototype

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

In case of Apache Ignite, most of savings is due to BinaryObject format,
which encodes types and fields with byte sequences. Any enum/string flags
will also get in dictionary. And then as it processes a record it fills up
its individual dictionary.

But, in one cache, most if not all entries have identical BinaryObject
layout so a tiny dictionary covers that case. Compression algorithms are
not very keen on large dictionaries, preferring to work with local
regularities in byte stream.

E.g. if we have large entries in cache with low BinaryObject overhead,
they're served just fine by "generic" compression.

All of the above is my speculations, actually. I just observe that on a
large data set, compression ratio is around 0.4 (2.5x) with a dictionary of
1024 bytes. The rest is black box.

Regards,
-- 
Ilya Kasnacheev


вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <ds...@apache.org>:

> On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <ilya.kasnacheev@gmail.com
> >
> wrote:
>
> > Hello!
> >
> > Each node has a local dictionary (per node currently, per cache planned).
> > Dictionary is never shared between nodes. As data patterns shift,
> > dictionary rotation is also planned.
> >
> > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It
> is
> > enough to store common BinaryObject boilerplate, and everything else is
> > compressed on the fly. The source sample is 16k records.
> >
> >
> Thanks, Ilya, understood. I think per-cache is a better idea. However, I
> have a question about dictionary size. Ignite stores TBs of data. How do
> you plan the dictionary to fit in 1K bytes?
>
> D.
>

Re: Compression prototype

Posted by Dmitriy Setrakyan <ds...@apache.org>.
On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> Each node has a local dictionary (per node currently, per cache planned).
> Dictionary is never shared between nodes. As data patterns shift,
> dictionary rotation is also planned.
>
> With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It is
> enough to store common BinaryObject boilerplate, and everything else is
> compressed on the fly. The source sample is 16k records.
>
>
Thanks, Ilya, understood. I think per-cache is a better idea. However, I
have a question about dictionary size. Ignite stores TBs of data. How do
you plan the dictionary to fit in 1K bytes?

D.

Re: Compression prototype

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Each node has a local dictionary (per node currently, per cache planned).
Dictionary is never shared between nodes. As data patterns shift,
dictionary rotation is also planned.

With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It is
enough to store common BinaryObject boilerplate, and everything else is
compressed on the fly. The source sample is 16k records.

Regards,
-- 
Ilya Kasnacheev


вт, 4 сент. 2018 г. в 11:49, Dmitriy Setrakyan <ds...@apache.org>:

> On Tue, Sep 4, 2018 at 1:16 AM, Ilya Kasnacheev <ilya.kasnacheev@gmail.com
> >
> wrote:
>
> > Hello!
> >
> > The compression is per-binary-object, but dictionary is external, shared
> > between multiple (millions of) entries and stored alongside compressed
> > data.
> >
>
> I was under a different impression. If the dictionary is for the whole data
> set, then it will occupy megabytes (if not gigabytes) of data. What happens
> when a new node joins and has no idea about the dictionary? What happens
> when dictionary between nodes get out-of-sync?
>
> D.
>

Re: Compression prototype

Posted by Dmitriy Setrakyan <ds...@apache.org>.
On Tue, Sep 4, 2018 at 1:16 AM, Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> The compression is per-binary-object, but dictionary is external, shared
> between multiple (millions of) entries and stored alongside compressed
> data.
>

I was under a different impression. If the dictionary is for the whole data
set, then it will occupy megabytes (if not gigabytes) of data. What happens
when a new node joins and has no idea about the dictionary? What happens
when dictionary between nodes get out-of-sync?

D.

Re: Compression prototype

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

The compression is per-binary-object, but dictionary is external, shared
between multiple (millions of) entries and stored alongside compressed data.

Regards,
-- 
Ilya Kasnacheev


вт, 4 сент. 2018 г. в 2:40, Dmitriy Setrakyan <ds...@apache.org>:

> Hi Ilya,
>
> This is very useful. Is the compression going to be per-page, in which case
> the dictionary is going to be kept inside of a page? Or do you have some
> other design in mind?
>
> D.
>
> On Mon, Sep 3, 2018 at 10:36 AM, Ilya Kasnacheev <
> ilya.kasnacheev@gmail.com>
> wrote:
>
> > Hello again!
> >
> > I've been running various compression parameters through cod dataset.
> >
> > It looks like the best compression level in terms of speed is either 1 or
> > 2.
> > The default for Zstd seems to be 3 which would almost always perform
> worse.
> > For best performance a dictionary of 1024 is optimal, for better
> > compression
> > one might choose larger dictionaries, 6k looks good but I will also run a
> > few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample
> size
> > is set to more than 16k entries (I guess I should probe the max buffer
> size
> > where problems begin).
> >
> > I'm attaching two charts which show what's we've got. Compression rate
> is a
> > fraction of original records size. Time to run is wall clock time the
> test
> > run. Reasonable compression will increase the run time twofold (of a
> > program
> > that only does text record parsing -> creates objects -> binarylizes them
> > ->
> > compresses -> decompresses). Notation: s{number of bin objects used to
> > train}-d{dictionary length in bytes}-l{compression level}.
> > <http://apache-ignite-developers.2346864.n4.nabble.
> > com/file/t374/chart1.png>
> > Second one is basically a zoom in on the first.
> > <http://apache-ignite-developers.2346864.n4.nabble.
> > com/file/t374/chart2.png>
> > I think that in additional to dictionary compression we should have
> > dictionary-less compression. On typical data of small records it shows
> > compression rate of 0.8 ~ 0.65, but I can imagine that with larger
> > unstructured records it can be as good as dict-based and much less of a
> > hassle dictionary-processing-wise. WDYT?
> > Sorry for the fine prints. I hope my charts will visible.
> >
> > You can see the updated code as pull request:
> > https://github.com/apache/ignite/pull/4673
> >
> > Regards,
> >
> >
> >
> > --
> > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> >
>

Re: Compression prototype

Posted by Dmitriy Setrakyan <ds...@apache.org>.
Hi Ilya,

This is very useful. Is the compression going to be per-page, in which case
the dictionary is going to be kept inside of a page? Or do you have some
other design in mind?

D.

On Mon, Sep 3, 2018 at 10:36 AM, Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello again!
>
> I've been running various compression parameters through cod dataset.
>
> It looks like the best compression level in terms of speed is either 1 or
> 2.
> The default for Zstd seems to be 3 which would almost always perform worse.
> For best performance a dictionary of 1024 is optimal, for better
> compression
> one might choose larger dictionaries, 6k looks good but I will also run a
> few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample size
> is set to more than 16k entries (I guess I should probe the max buffer size
> where problems begin).
>
> I'm attaching two charts which show what's we've got. Compression rate is a
> fraction of original records size. Time to run is wall clock time the test
> run. Reasonable compression will increase the run time twofold (of a
> program
> that only does text record parsing -> creates objects -> binarylizes them
> ->
> compresses -> decompresses). Notation: s{number of bin objects used to
> train}-d{dictionary length in bytes}-l{compression level}.
> <http://apache-ignite-developers.2346864.n4.nabble.
> com/file/t374/chart1.png>
> Second one is basically a zoom in on the first.
> <http://apache-ignite-developers.2346864.n4.nabble.
> com/file/t374/chart2.png>
> I think that in additional to dictionary compression we should have
> dictionary-less compression. On typical data of small records it shows
> compression rate of 0.8 ~ 0.65, but I can imagine that with larger
> unstructured records it can be as good as dict-based and much less of a
> hassle dictionary-processing-wise. WDYT?
> Sorry for the fine prints. I hope my charts will visible.
>
> You can see the updated code as pull request:
> https://github.com/apache/ignite/pull/4673
>
> Regards,
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>