You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by 黄权隆 <hu...@gmail.com> on 2017/09/22 09:06:23 UTC

Is it possible or reasonable for a Parquet file to have multiple compression types?

Hi all,

I asked a question in StackOverflow but seems here is the right place to
reach. Link:
https://stackoverflow.com/questions/46312522/is-it-possible-or-reasonable-for-a-parquet-file-to-have-multiple-compression-typ

After reading the thrift definition of metadata in Parquet (
https://github.com/apache/parquet-format#metadata), I found that there's a
CompressionCodec field in each ColumnMetaData.

Is it possible or reasonable for a Parquet file to have different
compression types for different columns?

If so, what's the scenario? How can I general such a file for test?
ParquetWriter only accept a compression type in its constructor.

If not, why not move the CompressionCodec field into the FileMetaData? For
example, ORC uses a uniform compression for all columns and writes its
compression type in Postscript (https://orc.apache.org/docs/file-tail.html).


Thanks
Quanlong

Re: Is it possible or reasonable for a Parquet file to have multiple compression types?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Columns could have different compression codecs, but I don't think
implementations currently support writing with more than one codec. Readers
should support it. I think that columns are the right level for choosing a
codec and there's very little overhead for this enum so it doesn't cost
very much to put it there instead of in the footer.

rb

On Fri, Sep 22, 2017 at 2:06 AM, 黄权隆 <hu...@gmail.com> wrote:

> Hi all,
>
> I asked a question in StackOverflow but seems here is the right place to
> reach. Link:
> https://stackoverflow.com/questions/46312522/is-it-
> possible-or-reasonable-for-a-parquet-file-to-have-multiple-compression-typ
>
> After reading the thrift definition of metadata in Parquet (
> https://github.com/apache/parquet-format#metadata), I found that there's a
> CompressionCodec field in each ColumnMetaData.
>
> Is it possible or reasonable for a Parquet file to have different
> compression types for different columns?
>
> If so, what's the scenario? How can I general such a file for test?
> ParquetWriter only accept a compression type in its constructor.
>
> If not, why not move the CompressionCodec field into the FileMetaData? For
> example, ORC uses a uniform compression for all columns and writes its
> compression type in Postscript (https://orc.apache.org/docs/file-tail.html
> ).
>
>
> Thanks
> Quanlong
>



-- 
Ryan Blue
Software Engineer
Netflix