You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2020/12/11 17:32:55 UTC

[GitHub] [parquet-format] gszadovszky opened a new pull request #164: PARQUET-1950: Define core features

gszadovszky opened a new pull request #164:
URL: https://github.com/apache/parquet-format/pull/164


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [PARQUET-1950](https://issues.apache.org/jira/browse/PARQUET-1950) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-XXX
     - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain Javadoc that explain what it does
   
   -----
   
   The whole document is up to discussion but the parts which are marked with a **?** or **TODO** are the ones where I don't have a hard opinion. Feel free to add any comment about content or wording.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r593032566



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       Yeah, I guessed so. I am not very good at it and that's why I've tried to write something here instead of in the document at the first place.
   
   I would like to formalize the following idea. Writer writes a logical type if has a data type with similar semantics. It is allowed to not to write some logical types if there are no similar values to write (kind of common sense). Reader have to recognize all the listed logical types and either convert (or read as is) to one of its own types or if no such internal types exists it may not read such values but have to inform the user. We do not want misinterpreted or silently skipped values.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542278289



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**

Review comment:
       I am fine removing LZO because of the licensing. Also, I am not sure if it would have significant benefits comparing to the others (LZ4 or ZSTD).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827

@gszadovszky and @emkornfield it's highly coincidental that I was just looking into cleaning up apache/arrow#8130 when I noticed this thread.
External column chunks support is one of the key features that attracted me to parquet in the first place and I would like the chance to lobby for keeping it and actually expanding its adoption - I already have the complete PR mentioned above and I can help with supporting it across other implementations.
There are a few major domains where I see this as valuable component:
1. Allowing concurrent read to fully flushed row groups while parquet file is still being appended to. A slight variant of this is allowing subsequent row group appends to a parquet file without impacting potential readers.
2. Being able to aggregate multiple data sets in a master parquet file: One scenario if cumulative recordings like stock prices that get collected daily and need to be presented as one unified historical file, another the case of enrichment where we want to add new columns to an existing data set.
3. Allowing for bi-temporal changes to parquet file: External columns chunks allows one to apply small corrections by simply creating delta files and new footers that simply swap out the chunks that require changes and point to the new ones.

If the above use cases are addressed by other parquet overlays or they don't line up with the intended usage of parquet I can look elsewhere but it seems like huge opportunity and the development cost for supporting it are quite minor by comparison

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r569240337



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -1041,6 +1041,13 @@ struct FileMetaData {
    * Used only in encrypted files with plaintext footer. 
    */ 
   9: optional binary footer_signing_key_metadata
+
+  /**
+   * This field might be set with the version number of a parquet-format release
+   * if this file is created by using only the features listed in the related
+   * list of core features. See CoreFeatures.md for details.

Review comment:
       It's a good point, @jorisvandenbossche. However, I am not sure what the other version is for. Maybe, it would be better to specify that one correctly and it would be clear that this one is different thing.
   Currently parquet-mr always writes `1` to that version field. I guess this is a file format version that we never incremented as it is backward compatible to the first ones we wrote. If it is the case then I think we will never increment it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r543277575



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**

Review comment:
       parquet-cpp doesn't support LZO, FTR. Also, I'm not sure LZO is interesting nowadays, compared to LZ4 and ZSTD.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542437689



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       Well, if some pages don't compress well you can skip the CPU cost of compression for them (I suppose that was the rationale for adding this feature?).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542431137



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       Based on parquet.thrift V2 only allows to select if a page is compressed or not. The compression codec is specified in `ColumnMetaData` so it can be set by column (and not page) for both V1 and V2.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r590617889



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**

Review comment:
       I've proposed a new `LZ4_RAW` codec in #168, where I also deprecate the current `LZ4` codec.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541131301



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**

Review comment:
       From a C++ perspective the licensing on LZO libraries is problematic I believe (we had an old JIRA to implement but I think it was closed for this reason).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r543638383



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**

Review comment:
       Yeah LZO doesn't stack up that well - https://facebook.github.io/zstd/. I think you can make a case for zstd+lz4 or zstd or none being part of core.
   
   It does have some advantages for text files because the container format and index file feature allowed splitting, but that's completely irrelevant for Parquet.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r592991912



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       So, what would be the general approach here? Do we want to add V2 to the first release of core features so we expect all the implementations to support it or we leave it for a next release?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r699241790



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values

Review comment:
       Parquet C++ should now support DELTA_BINARY_PACKED for reading.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542301349



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two

Review comment:
       I think it is more ant interoperability test than an integration one. Anyway it is a good idea to add this requirement here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771141210


   > Because non of the ideas of external column chunks nor the summary files were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field file_path in this document or even explicitly specify that this field is not supported.
   
   Being explicit seems reasonable to me if others are OK with it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-770306448


   It would be nice to cover the status of external file column chunk (https://github.com/apache/arrow/pull/8130 was opened for C++)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r576805491



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**

Review comment:
       As an update, while we (parquet-cpp) managed to fix the aforementioned compatibility issue, another compatibility issue appeared in the other direction (LZ4-compressed files produced by parquet-cpp cannot be read back by parquet-java): https://issues.apache.org/jira/browse/PARQUET-1974
   
   At this point, it seems the lack of specification around LZ4 compression makes it basically impossible to implement in a compatible manner. Therefore it shouldn't be in this spec.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771465002


   @ggershinsky, I think it is totally fine to say that encryption does not support external chunks and similar features even if they were fully supported by the implementations.
   
   BTW, as we are already talking about the encryption I did not plan to include this features here for now. I think this feature is not mature enough yet and also it is not something that every implementation requires. Also, it might be a good candidate for a later release of core features.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542286601



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**

Review comment:
       The big question here is if the other implementations can guarantee to read/write these compressions. For example parquet-mr uses hadoop for some of these therefore it depends on the environment if support them or not. If we put these to the core features parquet-mr cannot rely on the environment to provide the related codecs. (According to https://github.com/airlift/aircompressor it does not seem to be a big deal just requires some efforts and testing.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542300290



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and

Review comment:
       Good point, I'll do that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542443059



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       I guess so. It sounds reasonably. I hope someone will join to the review who have deeper knowledge on these.
   
   If we think V2 data page header worth it I am not against adding to the list for the first or maybe a later release. But first, we should list all the pros/cons.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541122935



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**

Review comment:
       I would love to use LZ4/ZSTD more cause the performance/density benefits can be significant; not sure which implementations have support yet though (Impala has both as of the 3.3 release).

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**

Review comment:
       Licensing of LZO may be an issue for some (main implementation is GPL, not Apache-compatible).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542302697



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?

Review comment:
       Thanks for the info. I'll remove this TODO.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542885102



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       This is a good point. I think this is true for logical types in general - if the application doesn't support the type at all, then it's a limitation of the application, not the application's Parquet implementation. It's reasonable for applications to not want to handle, say, timestamps.
   
   Of course parquet-mr and parquet-cpp need to be able to at least pass through the data to the application in all of these cases.
   
   So maybe all of these need to be qualified:
   - If the implementation is general-purpose Parquet library, users of the library must be able to access data with all logical types. It may just return the data in the same format as the primitive type. E.g. if JSON is just returned as a STRING, that seems fine.
   - If the implementation is anything else (SQL engine, computation framework, etc), then if it supports a primitive or logical type, it must support all core features associated with that type.
   
   E.g. suppose you have an application that only processes DECIMAL values. It might be a bad and useless application, but that's not Parquet's problem. If it can read and write all DECIMAL encodings correctly, then it's compliant. If it couldn't read decimals stored as int64, or if it wrote decimals with precision < 10, it's not compliant (see https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal).  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r559639264



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**

Review comment:
       For the record, we (meaning @chairmank) discovered an incompatibility in the LZ4 compatibility code that was added to Parquet C++ to handle Hadoop-produced files: see https://issues.apache.org/jira/browse/ARROW-11301




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771331050


   @raduteo the main driver for this PR is there has been a lot of confusion as what is defined as needing core support.  I think once we finish this PR I'm not fully opposed to the idea of supporting this field but I think we need to go into greater detail in the specification of what supporting the individual files actually means (and i think willing to help both Java and C++ support both can go a long way to convincing people that it should become a core feature).  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r699241790



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values

Review comment:
       Parquet C++ should now support DELTA_BINARY_PACKED for reading.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542896910



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**

Review comment:
       What does it mean that it uses hadoop for the codec? Does that mean it's difficult or impossible in some environments to enable the compression codecs?
   
   We ran into the same LZ4 issue too - https://issues.apache.org/jira/browse/IMPALA-8617




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] nevi-me commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-756627155


   > For the record, we started [documenting](https://github.com/apache/arrow/blob/master/docs/source/cpp/parquet.rst#supported-parquet-features) the features supported by parquet-cpp (which is part of the Arrow codebase).
   
   I've opened https://issues.apache.org/jira/browse/ARROW-11181 so we can do the same for the Rust implementation


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r576804103



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**

Review comment:
       > I think parquet-cpp and parquet-mr both support unsigned right? 
   
   parquet-cpp definitely does, since C++ has native unsigned integers (i.e. no cast to signed is involved).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r593038087



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       It is not. Logical types can significantly modify the meaning of a primitive type. Back to one of my examples. A BINARY primitive type can be a STRING or a DECIMAL. Even the statistics (min/max) are different for these two.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] ggershinsky commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

ggershinsky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771442874






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542330047



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       Doesn't the data page v2 header allow selective compression of individual pages?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r593034100



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       So it is not ok for a reader to ignore the logical type and still interpret the data based on the physical type?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r593002660



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**

Review comment:
       I am aware of only one implementation where the cast can be a problem, this is parquet-mr. There is no native support for unsigned values in primitives in java. In parquet-mr we give back the original bitmap in the related primitive (so a long for an unsigned int64) and it is up to the client to deal with the value. If the related client does not check the logical type (so realize that the value is unsigned) then the value might be represented incorrectly. 
   I think this would be a case in Hive. Meanwhile, I am not aware of any SQL engines that would write unsigned int values so currently it does not seem to be a problem.
   If we want to be on the safe side we should not add unsigned values because of the java limitations but I am not sure about this because of the issue would be on the client side of parquet-mr and not parquet-mr itself. What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542301349



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two

Review comment:
       I think it is more an interoperability test than an integration one. Anyway it is a good idea to add this requirement here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r544375834



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       My JSON example tried to point out that you have to recognize the logical type JSON to know it is a UTF-8 encoded string and not an arbitrary BINARY value. So in this case you at least partly support the logical type JSON.
   
   As per the specification a value with the logical type JSON has to be a UTF-8 encoded json document (json spec referenced). What the parquet implementation does with it is not part of the parquet spec. So, from write point of view you  shall only use this annotation if it is really a json and is in UTF-8. From read point of view it is up to the implementation I guess.
   
   I agree it is not always clear what to specify for supporting a logical type from reading. At least it shall respect the semantics of the related type. For example in case of TIMESTAMP if isAdjustedToUTC is true then the timestamp value shall be represented according to the *instant* semantics otherwise to the *local* semantics. (See details [here](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp).) Similarly even though UUID is represented in a binary format it is highly suggested to be represented in the UUID string format when displayed (if UUID logical type is supported).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771465002


   @ggershinsky, I think it is totally fine to say that encryption does not support external chunks and similar features even if they were fully supported by the implementations.
   
   BTW, as we are already talking about the encryption I did not plan to include this features here for now. I think this feature is not mature enough yet and also it is not something that every implementation requires. Also, it might be a good candidate for a later release of core features.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r593038805



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       That makes sense, thank you.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r699241790



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values

Review comment:
       Parquet C++ should now support DELTA_BINARY_PACKED for reading.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771331050


   @raduteo the main driver for this PR is there has been a lot of confusion as what is defined as needing core support.  I think once we finish this PR I'm not fully opposed to the idea of supporting this field but I think we need to go into greater detail in the specification of what supporting the individual files actually means (and i think willing to help both Java and C++ support both can go a long way to convincing people that it should become a core feature).  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r593025706



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       It's a bit difficult to read to me, what do you call "related type"?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r593025706



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       It's a bit difficult to read to me, what do you call "related type" or "related version of core features"?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] jorisvandenbossche commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r569260883



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries

Review comment:
       I *think* in Arrow parquet-cpp you can use RLE_DICTIONARY (when asking for version=2) without also using V2 data pages




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r544334816



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       I think, this topic is not that simple. However parquet-mr usually returns values related to the primitive types only (no conversion for logical type) there are some additional logic that depends on the logical type.
   * For example the sorting order matters when we are talking about statistics (min/max for row groups or column indexes). The sorting order might depend on the logical type (e.g. BINARY as a STRING or as a DECIMAL).
   * In your example about JSON you cannot return a String if you don't know that the related BINARY primitive contains an UTF-8 encoded value.
   * We have many problems with the different semantics used for timestamps in the different projects. This problem cannot be solved entirely in parquet but it can be the base of the solution. That's why we introduced the new timestamp logical type.
   * Some parquet-mr bindings (e.g. parquet-avro) do depend on the logical types when converting values
   * We already found some issues with the string representation. Impala did not write the logical type UTF-8 for strings only the primitive type BINARY. Meanwhile, iceberg required to have this logical type otherwise it recognized the value as a binary and it did not work with it.
   
   To summarize, we might not add all of the logical types (or not all of the primitive representations) to the *core features* but I think, we shall add at least the ones already used widely. The whole *core features* idea is about to help interoperability. Logical types are a big part of it.
   
   Although, from interoperability point of view a read-only parquet implementation that cannot do anything with the concept of DECIMAL it might not implement the reading of it. But if one would like to write a UTF-8 string than we shall require to use a STRING logical type in the schema for it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r567767040



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `BROTLI` **(?)**

Review comment:
       I am not sure if we want to list every possibility based on the format and state whether it is supported or not. LZO was removed after [these comments](https://github.com/apache/parquet-format/pull/164#discussion_r541122019).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r593021944



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       We shall agree on the list of the supported logical types and their primitive representations if there are more options. 
   
   For the "support" point of view I would add something like the following.
   > The writer shall support every listed logical types by either respecting its syntax/semantics or not using the related types at all. (For example if a writer does not use and therefore write any timestamp typed data it may still support the related version of core features.) The reader shall support every listed logical types where the related semantics matches one of its own data types. The reader can either convert the values of the other listed logical types to one of its own data type respecting the syntax/semantics (e.g. representing a timestamp as a string) or not reading these types. In the latter case the reader shall still recognize the related logical type and it shall be clear to the user that the file contains data that could not be represented. (The key here is if the reader has internal data representation for a semantics specified by one of these logical types it shall support reading it. Otherwise it shall either represent it in a meaningful way or not reading an
 d notifying the user.)
   
   What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] ggershinsky commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

ggershinsky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771442874


   to add the parquet encryption angle to this discussion. This feature adds protection of confidentiality and integrity of parquet files (when they have columns with sensitive data). These security layers will make it difficult to support many of the legacy features mentioned above, like external chunks or merging multiple files into a single master file (this interferes with definition of file integrity). Reading encrypted data is also difficult before file writing is finished. All of these are not impossible, but challenging, and would require an explicit scaffolding plus some Thrift format changes. If there is a strong demand for using encryption with these legacy features, despite them being deprecated (or with some of the mentioned new features), we can plan this for future versions of parquet-format, parquet-mr etc.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r575576769



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**

Review comment:
       I think parquet-cpp and parquet-mr both support unsigned right? I checked Impala and it would just interpret it as a signed integer. I think it's not the only engine consuming parquet than only has signed types.
   
   I guess included unsigned here seems fairly reasonable, but it would come with the caveat that the behaviour is implementation-dependent about how the cast to signed would be handled (e.g. allowing overflow or not).

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries

Review comment:
       Impala 3.4 and below wouldn't support this, FWIW.
   
   I also recently implemented this in Impala and on the read side it's orthogonal to the data page version, but does require additional code: IMPALA-6434




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-744385260


   @julienledem, @rdblue, could you please add your notes/ideas about this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542865014



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       It looks like parquet-cpp added support - https://issues.apache.org/jira/browse/PARQUET-458 so maybe we're the odd ones out now. https://issues.apache.org/jira/browse/IMPALA-6433 is the Impala JIRA
   
   @wesm  said "BTW V2 files are not considered suitable for production by the Parquet community" - https://issues.apache.org/jira/browse/PARQUET-458?focusedCommentId=16931914&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16931914
   
   This is probably relevant - not sure what issues he's referring to though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541120222



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?

Review comment:
       Impala does this for decimal values. I think this is fine and makes sense to do. E.g. in the TPC-H and probably TPC-DS data sets there are some cardinality decimal columns
   
   [localhost.EXAMPLE.COM:21050] default> show column stats tpch_parquet.lineitem;
   Query: show column stats tpch_parquet.lineitem
   +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
   | Column          | Type          | #Distinct Values | #Nulls | Max Size | Avg Size      | #Trues | #Falses |
   +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
   | l_orderkey      | BIGINT        | 1563438          | 0      | 8        | 8.0           | -1     | -1      |
   | l_partkey       | BIGINT        | 200516           | 0      | 8        | 8.0           | -1     | -1      |
   | l_suppkey       | BIGINT        | 9712             | 0      | 8        | 8.0           | -1     | -1      |
   | l_linenumber    | INT           | 7                | 0      | 4        | 4.0           | -1     | -1      |
   | l_quantity      | DECIMAL(12,2) | 51               | 0      | 8        | 8.0           | -1     | -1      |
   | l_extendedprice | DECIMAL(12,2) | 868550           | 0      | 8        | 8.0           | -1     | -1      |
   | l_discount      | DECIMAL(12,2) | 11               | 0      | 8        | 8.0           | -1     | -1      |
   | l_tax           | DECIMAL(12,2) | 9                | 0      | 8        | 8.0           | -1     | -1      |
   | l_returnflag    | STRING        | 3                | 0      | 1        | 1.0           | -1     | -1      |
   | l_linestatus    | STRING        | 2                | 0      | 1        | 1.0           | -1     | -1      |
   | l_shipdate      | STRING        | 2629             | 0      | 10       | 10.0          | -1     | -1      |
   | l_commitdate    | STRING        | 2559             | 0      | 10       | 10.0          | -1     | -1      |
   | l_receiptdate   | STRING        | 2658             | 0      | 10       | 10.0          | -1     | -1      |
   | l_shipinstruct  | STRING        | 4                | 0      | 17       | 11.9986381531 | -1     | -1      |
   | l_shipmode      | STRING        | 7                | 0      | 7        | 4.28530454636 | -1     | -1      |
   | l_comment       | STRING        | 4652621          | 0      | 43       | 26.4941692352 | -1     | -1      |
   +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r543278543



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       That sounds fine to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r544306592



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**

Review comment:
       Parquet-mr does not ship all of the codec implementations it supports. E.g. we do not unit test LZO, LZ4 and ZSTD because they are "not distributed in the default version of Hadoop". It means it is up to the user to install the related codecs to the hadoop environment so parquet-mr can use them. (It might mean that in some environments it is not possible to use some codecs but I don't have too mush experience in this.) This is the current situation.
   That's why I've said if we add these codecs to the *core features* than parquet-mr have to add the implementations of these codecs (as direct dependencies) so it can guarantee the codecs are accessible in any situation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542278289



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**

Review comment:
       I am fine removing LZO because of the licensing. Also, I am not sure if it would have significant benefits comparing to the others (LZ4 or ZSTD).
   UPDATE: It seems there is an Apache licensed LZO implementation: https://github.com/airlift/aircompressor




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771353955


   +1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541126494



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.

Review comment:
       ```suggestion
   implementations of it) and not on the core feature list are experimental.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r544356079



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       > In your example about JSON you cannot return a String if you don't know that the related BINARY primitive contains an UTF-8 encoded value.
   
   In practice that shouldn't be very problematic, as most JSON should be UTF8-encoded (note that the [IETF standard](https://tools.ietf.org/html/rfc7159) requires either UTF-8, UTF-16 or UTF-32, and states that UTF-8 is the default encoding).
   
   > The whole core features idea is about to help interoperability. Logical types are a big part of it.
   
   Agreed. The question is: what does _supporting_ a logical type mean? For example, if I accept JSON data and simply present it as regular string data, is that enough to say that I support it? Otherwise, am I supposed to do something specific?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771353955


   +1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-746410109


   For the record, we started [documenting](https://github.com/apache/arrow/blob/master/docs/source/cpp/parquet.rst#supported-parquet-features) the features supported by parquet-cpp (which is part of the Arrow codebase).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r569260891



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -1041,6 +1041,13 @@ struct FileMetaData {
    * Used only in encrypted files with plaintext footer. 
    */ 
   9: optional binary footer_signing_key_metadata
+
+  /**
+   * This field might be set with the version number of a parquet-format release
+   * if this file is created by using only the features listed in the related
+   * list of core features. See CoreFeatures.md for details.

Review comment:
       It doesn't sound good. It means that we are using a required field for different purposes. Because it is required we cannot deprecate it easily.
   parquet-mr does not read this field only writes `1` all the time. What does parquet-cpp does with this value at the read path?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541125479



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it

Review comment:
       ```suggestion
   for implementations . If an implementation claims that it
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541127737



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two

Review comment:
       In arrow we've adopted the need to for an integration test demonstrating compatibility.  This might be a good idea for new feature in parquet as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541123184



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that

Review comment:
       ```suggestion
   The list of core features for a certain release makes a compliance level that
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541126025



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and

Review comment:
       nit: What happens if an implementation happens to be read only?  It seems like distinguishing compliance for reading and writing could make sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541131936



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**
+
+### Statistics
+
+However understanding statistics is not crucial to read the data in a file we

Review comment:
       ```suggestion
   Statistics are not required for reading data but incorrect or under specified statistics implementation can cause data loss.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541128983



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       it might make sense to also call out here status of data page v2.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] jorisvandenbossche commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r569260134



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values

Review comment:
       Looking at the source code, the three DELTA encodings are not yet supported in Arrow parquet-cpp? (@emkornfield is that correct?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r567688108



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN

Review comment:
       The parts describing how the parquet-mr implementations work were not meant to be part of the final document. As I don't know too much about other implementations I've felt it is good to note the current behavior so others may understand the situation better. After a consensus of the related topic made I would remove this note.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] jorisvandenbossche commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r569287102



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -1041,6 +1041,13 @@ struct FileMetaData {
    * Used only in encrypted files with plaintext footer. 
    */ 
   9: optional binary footer_signing_key_metadata
+
+  /**
+   * This field might be set with the version number of a parquet-format release
+   * if this file is created by using only the features listed in the related
+   * list of core features. See CoreFeatures.md for details.

Review comment:
       AFAIK, the field is not used when reading (but I am not super familiar with the C++ code)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r591252577



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**

Review comment:
       I'll remove `LZ4` from here. I guess we cannot add `LZ4_RAW` here yet.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] jorisvandenbossche commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r569249894



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -1041,6 +1041,13 @@ struct FileMetaData {
    * Used only in encrypted files with plaintext footer. 
    */ 
   9: optional binary footer_signing_key_metadata
+
+  /**
+   * This field might be set with the version number of a parquet-format release
+   * if this file is created by using only the features listed in the related
+   * list of core features. See CoreFeatures.md for details.

Review comment:
       In Arrow (parquet-cpp) this `version` field actually gets populated with `1` or `2`, and depending on that different features are used to write the file (version 2 enabled some additional features, like writing nanosecond timestamps, or using RLE_DICTIONARY instead of PLAIN_DICTIONARY)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542293627



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       You mean as a note? I am currently not sure if data page v2 worth to be supported at all. It was never widely used and the benefits of it are more about using some more advanced encodings and not the v2 page header itself.
   So, I am not sure about the status of v2.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541128129



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"

Review comment:
       ```suggestion
   ## Core feature list
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] ggershinsky commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

ggershinsky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771478289


   @gszadovszky I certainly agree the encryption feature is not ready yet to be on this list.  According to the definition, we need to "have at least two different implementations that are released and widely tested". While we already have parquet-mr and parquet-cpp implementations, their release and testing status is not yet at that point. We can revisit this for a later version of the CoreFeatures.md.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-772102343


   > +1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer.
   
   Thank you @emkornfield and @timarmstrong for the clarifications! 
   Btw, I am 100% in favor of the current initiative and I can relate to the world of pain one has to go through navigating parquet incompatibilities and I can definitely see how this can mitigate those issues while allowing the standard and underlying implementations to evolve.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542333695



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `LZO` **(?)**
+* `BROTLI` **(?)**
+* `LZ4` **(?)**
+* `ZSTD` **(?)**

Review comment:
       parquet-cpp has recently fixed its LZ4 support to be compatible with the weird encoding used by parquet-mr.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542332574



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**

Review comment:
       The last three seem highly application-specific. For example, Arrow doesn't have native types for them.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] jorisvandenbossche commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r569228238



##########
File path: src/main/thrift/parquet.thrift
##########
@@ -1041,6 +1041,13 @@ struct FileMetaData {
    * Used only in encrypted files with plaintext footer. 
    */ 
   9: optional binary footer_signing_key_metadata
+
+  /**
+   * This field might be set with the version number of a parquet-format release
+   * if this file is created by using only the features listed in the related
+   * list of core features. See CoreFeatures.md for details.

Review comment:
       It might be useful to clarify how this differs from the `version` key above ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] timarmstrong commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

timarmstrong commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r542844935



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       I like the selective compression feature personally, it's a smart idea. The other DataPageV2 layout changes wrt rep/def levels make sense.
   
   We haven't implemented it in Impala, but it would be worthwhile if we determine that it's going to be universally supported (chicken/egg problem - no point implementing if we can't turn it on).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-770716726

@emkornfield, in parquet-mr there was another reason to use the `file_path` in the footer. The feature is called _summary files_. The idea was to have a separate file containing a summarized footer of several parquet files so you might do filtering and pruning without even checking a file's own footer. As far as I know this implementation exists in parquet-mr only and there are no specification for it in parquet-format.
This feature is more or less abandoned meaning during the development of some newer features (e.g. column indexes, bloom filters) the related parts might not updated properly. There were a couple of discussions about this topic in the dev list: [here](https://lists.apache.org/thread.html/fb232d024d3ca0f3900b76fb884b55fad11dffafb182d6f336b37a69%40%3Cdev.parquet.apache.org%3E) and [here](https://lists.apache.org/thread.html/r2e539c50c1cc818304de2b7dc28a4109aaa529955a42664e3073f811%40%3Cdev.parquet.apache.org%3E).

Because non of the ideas of _external column chunks_ nor the _summary files_ were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field `file_path` in this document or even explicitly specify that this field is not supported.

I am open to specify such features properly and after the required demonstration we may include them in a later version of the core features. However, I think these requirements (e.g. snapshot API, summary files) are not necessarily needed by all of our clients or already implemented in some ways (e.g. storing statistics in HMS, Iceberg).

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r593026605



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet footer
+(aka Parquet Thrift file) makes it available to reference an external file. This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries

Review comment:
       Another general choice here. Do we want to add encodings that might not supported widely yet at the first round or add only the ones we regularly use and leave the rest (maybe some even newer ones) for a later release?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] wesm commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r565684245



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that

Review comment:
       Parquet

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where

Review comment:
       Parquet

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.

Review comment:
       Parquet

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.

Review comment:
       nit: switch b and d in this example

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)

Review comment:
       Parquet

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+### Compression
+
+The following compression algorithms are supported (including `UNCOMPRESSED`).
+* `SNAPPY`
+* `GZIP`
+* `BROTLI` **(?)**

Review comment:
       Is LZO deprecated? If so we should state that

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.

Review comment:
       Parquet

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN

Review comment:
       This is the first time these abbreviations are used

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document

Review comment:
       Parquet

##########
File path: README.md
##########
@@ -239,6 +239,13 @@ There are many places in the format for compatible extensions:
 - Encodings: Encodings are specified by enum and more can be added in the future.
 - Page types: Additional page types can be added and safely skipped.
 
+## Compatibility
+Because of the many features got into the Parquet format it is hard for the

Review comment:
       "got into the" -> "that have been added to the"?

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.

Review comment:
       Thrift (capitalize elsewhere)

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,178 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible

Review comment:
       Thrift

##########
File path: README.md
##########
@@ -239,6 +239,13 @@ There are many places in the format for compatible extensions:
 - Encodings: Encodings are specified by enum and more can be added in the future.
 - Page types: Additional page types can be added and safely skipped.
 
+## Compatibility
+Because of the many features got into the Parquet format it is hard for the
+different implementations to keep up. We introduced the list of "core
+features". This document is versioned by the parquet format releases and defines

Review comment:
       Parquet

##########
File path: README.md
##########
@@ -239,6 +239,13 @@ There are many places in the format for compatible extensions:
 - Encodings: Encodings are specified by enum and more can be added in the future.
 - Page types: Additional page types can be added and safely skipped.
 
+## Compatibility
+Because of the many features got into the Parquet format it is hard for the
+different implementations to keep up. We introduced the list of "core

Review comment:
       Is it hard? I don't think it's our place to say. We should stay objective and state that some implementations have not kept up.

##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)

Review comment:
       I meant that they aren't widely supported, so if you use it you will find implementations that can't read your files (per the chicken/egg problem above)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] emkornfield commented on a change in pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r541130559



##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,181 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certian release makes a compliance level that
+the different implementations can tied to. If an implementation claims that it
+provides the functionality of a parquet-format release core features it must
+implement all of the listed features according the specification (both read and
+write path). This way it is easier to ensure compatibility between the
+different parquet implementations.
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on this list are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.b.x` it must be able to read any
+parquet files written by implementations supporting the release `a.d.y` where
+`b >= d`.
+
+If a parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested.
+
+## The "list"
+
+This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
+* `UNKNOWN` **(?)**
+* `JSON` **(?)**
+* `BSON` **(?)**
+* `UUID` **(?)**
+
+NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes.
+This is becasue `INTERVAL` is deprecated so we do not include it in this list.
+
+### Encodings
+
+The following encodings are supported:
+* [PLAIN](Encodings.md#plain-plain--0)  
+  parquet-mr: Basically all value types are written in this encoding in case of
+  V1 pages
+* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: As per the spec this encoding is deprecated while we still use it
+  for V1 page dictionaries.
+* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)  
+  parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
+  values in case of V2 pages
+* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
+* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
+  **(?)**  
+  parquet-mr: Not used directly
+* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
+  **(?)**  
+  parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
+  FIXED\_LEN\_BYTE\_ARRAY values
+* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
+  **(?)**  
+  parquet-mr: Used for V2 page dictionaries
+* [BYTE\_STREAM\_SPLIT](Encodings.md#byte-stream-split-byte_stream_split--9)
+  **(?)**  
+  parquet-mr: Not used by default; can be used only via explicit configuration
+
+NOTE: [BIT\_PACKED](Encodings.md#bit-packed-deprecated-bit_packed--4) is
+deprecated and not used directly (boolean values are encoded with this under
+PLAIN) so not included in this list.
+
+**TODO**: In parquet-mr dictionary encoding is not enabled for
+FIXED\_LEN\_BYTE\_ARRAY in case of writing V1 pages. I don't know the reason
+behind. Any experience/idea about this from other implementations?

Review comment:
       I believe Arrow/Parquet CPP also does dictionary encoding here (would need to double check).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-format] nevi-me commented on pull request #164: PARQUET-1950: Define core features

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-756627155


   > For the record, we started [documenting](https://github.com/apache/arrow/blob/master/docs/source/cpp/parquet.rst#supported-parquet-features) the features supported by parquet-cpp (which is part of the Arrow codebase).
   
   I've opened https://issues.apache.org/jira/browse/ARROW-11181 so we can do the same for the Rust implementation


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org