You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by James Turton <dz...@apache.org> on 2021/09/29 12:27:50 UTC
Parquet compression codecs
Hi all
We've got support for reading and writing using additional Parquet
compression codecs in master now. Here are the footprints of a 25M
record dataset compressed by Drill with different codecs.
| Codec | Size on disk (Mb) |
| ------ | ----------------- |
| brotli | 87 |
| gzip | 80 |
| lz4 | 100.6 |
| lzo | 100.8 |
| snappy | 192 |
| zstd | 85 |
| none | 2152 |
I haven't made measurements of (de)compression speed differences myself
but there are many such benchmarks around on the web, and the
differences can be big *if* you've got a workload that is CPU bound by
(de)compression. Beyond that there are the usual considerations like
better utilisation of the OS page cache by the higher compression ratio
codecs, less I/O when data must come from disk, etc. Zstd is probably
the one I'll be putting into `store.parquet.compression` myself at this
point.
Happy Drilling!
James
Re: Parquet compression codecs
Posted by James Turton <dz...@apache.org>.
Added to my to-do list. I'm debugging our Parquet v2 page reader code
at the moment, then I'll do a combined post about "Parquet improvements".
On 2021/09/29 16:46, Ted Dunning wrote:
> A blog is a great idea.
>
> I am curious about how much compression costs.
>
>
> On Wed, Sep 29, 2021 at 5:37 AM luoc <lu...@apache.org> wrote:
>
>> James, you are doing fine.
>> Is it possible to post a new blog in the website for this?
>>
>>> 在 2021年9月29日,20:27,James Turton <dz...@apache.org> 写道:
>>>
>>> Hi all
>>>
>>> We've got support for reading and writing using additional Parquet
>> compression codecs in master now. Here are the footprints of a 25M record
>> dataset compressed by Drill with different codecs.
>>> | Codec | Size on disk (Mb) |
>>> | ------ | ----------------- |
>>> | brotli | 87 |
>>> | gzip | 80 |
>>> | lz4 | 100.6 |
>>> | lzo | 100.8 |
>>> | snappy | 192 |
>>> | zstd | 85 |
>>> | none | 2152 |
>>>
>>> I haven't made measurements of (de)compression speed differences myself
>> but there are many such benchmarks around on the web, and the differences
>> can be big *if* you've got a workload that is CPU bound by
>> (de)compression. Beyond that there are the usual considerations like
>> better utilisation of the OS page cache by the higher compression ratio
>> codecs, less I/O when data must come from disk, etc. Zstd is probably the
>> one I'll be putting into `store.parquet.compression` myself at this point.
>>> Happy Drilling!
>>> James
>>
Re: Parquet compression codecs
Posted by Ted Dunning <te...@gmail.com>.
A blog is a great idea.
I am curious about how much compression costs.
On Wed, Sep 29, 2021 at 5:37 AM luoc <lu...@apache.org> wrote:
>
> James, you are doing fine.
> Is it possible to post a new blog in the website for this?
>
> > 在 2021年9月29日,20:27,James Turton <dz...@apache.org> 写道:
> >
> > Hi all
> >
> > We've got support for reading and writing using additional Parquet
> compression codecs in master now. Here are the footprints of a 25M record
> dataset compressed by Drill with different codecs.
> >
> > | Codec | Size on disk (Mb) |
> > | ------ | ----------------- |
> > | brotli | 87 |
> > | gzip | 80 |
> > | lz4 | 100.6 |
> > | lzo | 100.8 |
> > | snappy | 192 |
> > | zstd | 85 |
> > | none | 2152 |
> >
> > I haven't made measurements of (de)compression speed differences myself
> but there are many such benchmarks around on the web, and the differences
> can be big *if* you've got a workload that is CPU bound by
> (de)compression. Beyond that there are the usual considerations like
> better utilisation of the OS page cache by the higher compression ratio
> codecs, less I/O when data must come from disk, etc. Zstd is probably the
> one I'll be putting into `store.parquet.compression` myself at this point.
> >
> > Happy Drilling!
> > James
>
>
Re: Parquet compression codecs
Posted by luoc <lu...@apache.org>.
James, you are doing fine.
Is it possible to post a new blog in the website for this?
> 在 2021年9月29日,20:27,James Turton <dz...@apache.org> 写道:
>
> Hi all
>
> We've got support for reading and writing using additional Parquet compression codecs in master now. Here are the footprints of a 25M record dataset compressed by Drill with different codecs.
>
> | Codec | Size on disk (Mb) |
> | ------ | ----------------- |
> | brotli | 87 |
> | gzip | 80 |
> | lz4 | 100.6 |
> | lzo | 100.8 |
> | snappy | 192 |
> | zstd | 85 |
> | none | 2152 |
>
> I haven't made measurements of (de)compression speed differences myself but there are many such benchmarks around on the web, and the differences can be big *if* you've got a workload that is CPU bound by (de)compression. Beyond that there are the usual considerations like better utilisation of the OS page cache by the higher compression ratio codecs, less I/O when data must come from disk, etc. Zstd is probably the one I'll be putting into `store.parquet.compression` myself at this point.
>
> Happy Drilling!
> James