You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by James Turton <dz...@apache.org> on 2021/09/29 12:27:50 UTC

Parquet compression codecs

Hi all

We've got support for reading and writing using additional Parquet 
compression codecs in master now.  Here are the footprints of a 25M 
record dataset compressed by Drill with different codecs.

| Codec  | Size on disk (Mb) |
| ------ | ----------------- |
| brotli |   87              |
| gzip   |   80              |
| lz4    |  100.6            |
| lzo    |  100.8            |
| snappy |  192              |
| zstd   |   85              |
| none   | 2152              |

I haven't made measurements of (de)compression speed differences myself 
but there are many such benchmarks around on the web, and the 
differences can be big *if* you've got a workload that is CPU bound by 
(de)compression.  Beyond that there are the usual considerations like 
better utilisation of the OS page cache by the higher compression ratio 
codecs, less I/O when data must come from disk, etc.  Zstd is probably 
the one I'll be putting into `store.parquet.compression` myself at this 
point.

Happy Drilling!
James

Re: Parquet compression codecs

Posted by James Turton <dz...@apache.org>.
Added to my to-do list.  I'm debugging our Parquet v2 page reader code 
at the moment, then I'll do a combined post about "Parquet improvements".

On 2021/09/29 16:46, Ted Dunning wrote:
> A blog is a great idea.
>
> I am curious about how much compression costs.
>
>
> On Wed, Sep 29, 2021 at 5:37 AM luoc <lu...@apache.org> wrote:
>
>> James, you are doing fine.
>> Is it possible to post a new blog in the website for this?
>>
>>> 在 2021年9月29日,20:27,James Turton <dz...@apache.org> 写道:
>>>
>>> Hi all
>>>
>>> We've got support for reading and writing using additional Parquet
>> compression codecs in master now.  Here are the footprints of a 25M record
>> dataset compressed by Drill with different codecs.
>>> | Codec  | Size on disk (Mb) |
>>> | ------ | ----------------- |
>>> | brotli |   87              |
>>> | gzip   |   80              |
>>> | lz4    |  100.6            |
>>> | lzo    |  100.8            |
>>> | snappy |  192              |
>>> | zstd   |   85              |
>>> | none   | 2152              |
>>>
>>> I haven't made measurements of (de)compression speed differences myself
>> but there are many such benchmarks around on the web, and the differences
>> can be big *if* you've got a workload that is CPU bound by
>> (de)compression.  Beyond that there are the usual considerations like
>> better utilisation of the OS page cache by the higher compression ratio
>> codecs, less I/O when data must come from disk, etc.  Zstd is probably the
>> one I'll be putting into `store.parquet.compression` myself at this point.
>>> Happy Drilling!
>>> James
>>


Re: Parquet compression codecs

Posted by Ted Dunning <te...@gmail.com>.
A blog is a great idea.

I am curious about how much compression costs.


On Wed, Sep 29, 2021 at 5:37 AM luoc <lu...@apache.org> wrote:

>
> James, you are doing fine.
> Is it possible to post a new blog in the website for this?
>
> > 在 2021年9月29日,20:27,James Turton <dz...@apache.org> 写道:
> >
> > Hi all
> >
> > We've got support for reading and writing using additional Parquet
> compression codecs in master now.  Here are the footprints of a 25M record
> dataset compressed by Drill with different codecs.
> >
> > | Codec  | Size on disk (Mb) |
> > | ------ | ----------------- |
> > | brotli |   87              |
> > | gzip   |   80              |
> > | lz4    |  100.6            |
> > | lzo    |  100.8            |
> > | snappy |  192              |
> > | zstd   |   85              |
> > | none   | 2152              |
> >
> > I haven't made measurements of (de)compression speed differences myself
> but there are many such benchmarks around on the web, and the differences
> can be big *if* you've got a workload that is CPU bound by
> (de)compression.  Beyond that there are the usual considerations like
> better utilisation of the OS page cache by the higher compression ratio
> codecs, less I/O when data must come from disk, etc.  Zstd is probably the
> one I'll be putting into `store.parquet.compression` myself at this point.
> >
> > Happy Drilling!
> > James
>
>

Re: Parquet compression codecs

Posted by luoc <lu...@apache.org>.
James, you are doing fine.
Is it possible to post a new blog in the website for this?

> 在 2021年9月29日,20:27,James Turton <dz...@apache.org> 写道:
> 
> Hi all
> 
> We've got support for reading and writing using additional Parquet compression codecs in master now.  Here are the footprints of a 25M record dataset compressed by Drill with different codecs.
> 
> | Codec  | Size on disk (Mb) |
> | ------ | ----------------- |
> | brotli |   87              |
> | gzip   |   80              |
> | lz4    |  100.6            |
> | lzo    |  100.8            |
> | snappy |  192              |
> | zstd   |   85              |
> | none   | 2152              |
> 
> I haven't made measurements of (de)compression speed differences myself but there are many such benchmarks around on the web, and the differences can be big *if* you've got a workload that is CPU bound by (de)compression.  Beyond that there are the usual considerations like better utilisation of the OS page cache by the higher compression ratio codecs, less I/O when data must come from disk, etc.  Zstd is probably the one I'll be putting into `store.parquet.compression` myself at this point.
> 
> Happy Drilling!
> James