You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Mohammad Islam <mi...@yahoo.com.INVALID> on 2016/06/30 17:44:11 UTC

Supporting attribute in Parquet schema

Hi All,
What is the best way of tagging any field schema with metadata? Does Parquet support it? I think Avro has "doc" attribute. Also Hive schema has "comments".
I need to tag each field whether it is PII or not. I think someone may want to add description of a field as well.
Regards,Mohammad


Re: Supporting attribute in Parquet schema

Posted by Mohammad Islam <mi...@yahoo.com.INVALID>.
Thanks Julien and Nong. Looks like parquet supports both column (re: Nong) and file (re: Julien) level metadata.right?
Regards,Mohammad
 

    On Thursday, June 30, 2016 11:37 AM, Julien Le Dem <ju...@ledem.net> wrote:
 

 You can store arbitrary key values alongside the schema in the footer:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565 <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565>
struct FileMetaData {
  /** Version of this file **/
  1: required i32 version

  /** Parquet schema for this file.  This schema contains metadata for all the columns.
  * The schema is represented as a tree with a single root.  The nodes of the tree
  * are flattened to a list by doing a depth-first traversal.
  * The column metadata contains the path in the schema for that column which can be
  * used to map columns to nodes in the schema.
  * The first element is the root **/
  2: required list<SchemaElement> schema;

  /** Number of rows in this file **/
  3: required i64 num_rows

  /** Row groups in this file **/
  4: required list<RowGroup> row_groups

  /** Optional key/value metadata **/
  5: optional list<KeyValue> key_value_metadata

  /** String for application that wrote this file.  This should be in the format
  * <Application> version <App Version> (build <App Build Hash>).
  * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
  **/
  6: optional string created_by
}

You could make the key something like "{some unique name prefix specific to you}.PII.columns”=a.b.c,d.e.f


> On Jun 30, 2016, at 10:44 AM, Mohammad Islam <mi...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> What is the best way of tagging any field schema with metadata? Does Parquet support it? I think Avro has "doc" attribute. Also Hive schema has "comments".
> I need to tag each field whether it is PII or not. I think someone may want to add description of a field as well.
> Regards,Mohammad
> 


  

Re: Supporting attribute in Parquet schema

Posted by Julien Le Dem <ju...@ledem.net>.
You can store arbitrary key values alongside the schema in the footer:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565 <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565>
struct FileMetaData {
  /** Version of this file **/
  1: required i32 version

  /** Parquet schema for this file.  This schema contains metadata for all the columns.
   * The schema is represented as a tree with a single root.  The nodes of the tree
   * are flattened to a list by doing a depth-first traversal.
   * The column metadata contains the path in the schema for that column which can be
   * used to map columns to nodes in the schema.
   * The first element is the root **/
  2: required list<SchemaElement> schema;

  /** Number of rows in this file **/
  3: required i64 num_rows

  /** Row groups in this file **/
  4: required list<RowGroup> row_groups

  /** Optional key/value metadata **/
  5: optional list<KeyValue> key_value_metadata

  /** String for application that wrote this file.  This should be in the format
   * <Application> version <App Version> (build <App Build Hash>).
   * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
   **/
  6: optional string created_by
}

You could make the key something like "{some unique name prefix specific to you}.PII.columns”=a.b.c,d.e.f


> On Jun 30, 2016, at 10:44 AM, Mohammad Islam <mi...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> What is the best way of tagging any field schema with metadata? Does Parquet support it? I think Avro has "doc" attribute. Also Hive schema has "comments".
> I need to tag each field whether it is PII or not. I think someone may want to add description of a field as well.
> Regards,Mohammad
> 


Re: Supporting attribute in Parquet schema

Posted by Nong Li <no...@gmail.com>.
Columns have support for key/value pairs in the metadata:

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L489

Let me know if that works for you.

On Thu, Jun 30, 2016 at 10:44 AM, Mohammad Islam <mislam77@yahoo.com.invalid
> wrote:

> Hi All,
> What is the best way of tagging any field schema with metadata? Does
> Parquet support it? I think Avro has "doc" attribute. Also Hive schema has
> "comments".
> I need to tag each field whether it is PII or not. I think someone may
> want to add description of a field as well.
> Regards,Mohammad
>
>