You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Tenghuan He <te...@gmail.com> on 2016/01/23 17:48:20 UTC

parquet-format parquet.thrift struct ColumnMetaData problem

Hi everyone,

In parquet.thrift the definition of struct ColumnMetaData

   1.

   The field "path_in_schema" is a string list, should not there be only
   one path in the schema for a specified column? And in parquet-hadoop the
   corresponding class "ColumnChunkMetaData" there is the field "ColumnPath
   path", which is not a list.
   2.

   The field "codec" which represents the compression codec of the column,
   why is it not a list? Must all pages in the same column use the same
   compression codec?

Can anyone explain this?

Below is the definition snippet of ColumnMetaData in parquet.thrift.

struct ColumnMetaData {
  ...
  3: required list<string> path_in_schema

  4: required CompressionCodec codec
  ...
}

Thanks & Best Regards

——————————

Tenghuan He

Re: parquet-format parquet.thrift struct ColumnMetaData problem

Posted by Nong Li <no...@gmail.com>.
Inline.

On Sat, Jan 23, 2016 at 8:48 AM, Tenghuan He <te...@gmail.com> wrote:

> Hi everyone,
>
> In parquet.thrift the definition of struct ColumnMetaData
>
>    1.
>
>    The field "path_in_schema" is a string list, should not there be only
>    one path in the schema for a specified column? And in parquet-hadoop the
>    corresponding class "ColumnChunkMetaData" there is the field "ColumnPath
>    path", which is not a list.
>
The list is the pieces of the path. For example: struct1.struct2.field1
would have
a three element list. This is typically how the consumer wants to use the
path
and it avoids issues like how to escape dots and what not.

Each column has a unique path.


>    2.
>
>    The field "codec" which represents the compression codec of the column,
>    why is it not a list? Must all pages in the same column use the same
>    compression codec?
>
> Can anyone explain this?
>
Yes, all pages need the same compression.  This would be easy to change
(each
page can have a different encoding already) but we' need some good evidence
that this helps in practice. We already don't explore all the ways to use
the encodings
and imo, we should move away from general purpose compression and just rely
on
the encodings.


>
> Below is the definition snippet of ColumnMetaData in parquet.thrift.
>
> struct ColumnMetaData {
>   ...
>   3: required list<string> path_in_schema
>
>   4: required CompressionCodec codec
>   ...
> }
>
> Thanks & Best Regards
>
> ——————————
>
> Tenghuan He
>