You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Mohammad Islam <mi...@yahoo.com.INVALID> on 2016/06/30 17:44:11 UTC
Supporting attribute in Parquet schema
Hi All,
What is the best way of tagging any field schema with metadata? Does Parquet support it? I think Avro has "doc" attribute. Also Hive schema has "comments".
I need to tag each field whether it is PII or not. I think someone may want to add description of a field as well.
Regards,Mohammad
Re: Supporting attribute in Parquet schema
Posted by Mohammad Islam <mi...@yahoo.com.INVALID>.
Thanks Julien and Nong. Looks like parquet supports both column (re: Nong) and file (re: Julien) level metadata.right?
Regards,Mohammad
On Thursday, June 30, 2016 11:37 AM, Julien Le Dem <ju...@ledem.net> wrote:
You can store arbitrary key values alongside the schema in the footer:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565 <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565>
struct FileMetaData {
/** Version of this file **/
1: required i32 version
/** Parquet schema for this file. This schema contains metadata for all the columns.
* The schema is represented as a tree with a single root. The nodes of the tree
* are flattened to a list by doing a depth-first traversal.
* The column metadata contains the path in the schema for that column which can be
* used to map columns to nodes in the schema.
* The first element is the root **/
2: required list<SchemaElement> schema;
/** Number of rows in this file **/
3: required i64 num_rows
/** Row groups in this file **/
4: required list<RowGroup> row_groups
/** Optional key/value metadata **/
5: optional list<KeyValue> key_value_metadata
/** String for application that wrote this file. This should be in the format
* <Application> version <App Version> (build <App Build Hash>).
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
**/
6: optional string created_by
}
You could make the key something like "{some unique name prefix specific to you}.PII.columns”=a.b.c,d.e.f
> On Jun 30, 2016, at 10:44 AM, Mohammad Islam <mi...@yahoo.com.INVALID> wrote:
>
> Hi All,
> What is the best way of tagging any field schema with metadata? Does Parquet support it? I think Avro has "doc" attribute. Also Hive schema has "comments".
> I need to tag each field whether it is PII or not. I think someone may want to add description of a field as well.
> Regards,Mohammad
>
Re: Supporting attribute in Parquet schema
Posted by Julien Le Dem <ju...@ledem.net>.
You can store arbitrary key values alongside the schema in the footer:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565 <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565>
struct FileMetaData {
/** Version of this file **/
1: required i32 version
/** Parquet schema for this file. This schema contains metadata for all the columns.
* The schema is represented as a tree with a single root. The nodes of the tree
* are flattened to a list by doing a depth-first traversal.
* The column metadata contains the path in the schema for that column which can be
* used to map columns to nodes in the schema.
* The first element is the root **/
2: required list<SchemaElement> schema;
/** Number of rows in this file **/
3: required i64 num_rows
/** Row groups in this file **/
4: required list<RowGroup> row_groups
/** Optional key/value metadata **/
5: optional list<KeyValue> key_value_metadata
/** String for application that wrote this file. This should be in the format
* <Application> version <App Version> (build <App Build Hash>).
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
**/
6: optional string created_by
}
You could make the key something like "{some unique name prefix specific to you}.PII.columns”=a.b.c,d.e.f
> On Jun 30, 2016, at 10:44 AM, Mohammad Islam <mi...@yahoo.com.INVALID> wrote:
>
> Hi All,
> What is the best way of tagging any field schema with metadata? Does Parquet support it? I think Avro has "doc" attribute. Also Hive schema has "comments".
> I need to tag each field whether it is PII or not. I think someone may want to add description of a field as well.
> Regards,Mohammad
>
Re: Supporting attribute in Parquet schema
Posted by Nong Li <no...@gmail.com>.
Columns have support for key/value pairs in the metadata:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L489
Let me know if that works for you.
On Thu, Jun 30, 2016 at 10:44 AM, Mohammad Islam <mislam77@yahoo.com.invalid
> wrote:
> Hi All,
> What is the best way of tagging any field schema with metadata? Does
> Parquet support it? I think Avro has "doc" attribute. Also Hive schema has
> "comments".
> I need to tag each field whether it is PII or not. I think someone may
> want to add description of a field as well.
> Regards,Mohammad
>
>