You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/15 16:33:47 UTC

[GitHub] [arrow] pitrou opened a new pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

pitrou opened a new pull request #8928:
URL: https://github.com/apache/arrow/pull/8928


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#issuecomment-745411030


   @emkornfield I'm not sure everything is covered adequately. I would welcome your validation on this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#issuecomment-745420463


   https://issues.apache.org/jira/browse/ARROW-10918


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r544082188



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,207 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+------------+
+| Physical type            | Mapped Arrow type       | Notes      |
++==========================+=========================+============+
+| BOOLEAN                  | Boolean                 |            |
++--------------------------+-------------------------+------------+
+| INT32                    | Int32 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT64                    | Int64 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT96                    | Timestamp (nanoseconds) | \(2)       |
++--------------------------+-------------------------+------------+
+| FLOAT                    | Float32                 |            |
++--------------------------+-------------------------+------------+
+| DOUBLE                   | Float64                 |            |
++--------------------------+-------------------------+------------+
+| BYTE_ARRAY               | Binary / other          | \(1) \(3)  |
++--------------------------+-------------------------+------------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)       |
++--------------------------+-------------------------+------------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+* \(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DATE              | INT32                       | Date32                     | \(2)    |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT32                       | Time32 (milliseconds)      |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT64                       | Time64 (micro- or          |         |
+|                   |                             | nanoseconds)               |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIMESTAMP         | INT64                       | Timestamp (milli-, micro-  |         |
+|                   |                             | or nanoseconds)            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+| LIST              | Any                         | List                       | \(4)    |
++-------------------+-----------------------------+----------------------------+---------+
+| MAP               | Any                         | Map                        |         |
++-------------------+-----------------------------+----------------------------+---------+
+
+* \(1) On the write side, the Parquet physical type INT32 is generated.
+
+* \(2) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.
+
+* \(3) On the write side, an Arrow LargeUtf8 is also mapped to a Parquet STRING.
+
+* \(4) On the write side, an Arrow LargeList or FixedSizedList is also mapped to

Review comment:
       it might be worth noting that unless the file was written with arrow metadata Arrow will not handle converting to Large types.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r544081781



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,207 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+------------+
+| Physical type            | Mapped Arrow type       | Notes      |
++==========================+=========================+============+
+| BOOLEAN                  | Boolean                 |            |
++--------------------------+-------------------------+------------+
+| INT32                    | Int32 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT64                    | Int64 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT96                    | Timestamp (nanoseconds) | \(2)       |
++--------------------------+-------------------------+------------+
+| FLOAT                    | Float32                 |            |
++--------------------------+-------------------------+------------+
+| DOUBLE                   | Float64                 |            |
++--------------------------+-------------------------+------------+
+| BYTE_ARRAY               | Binary / other          | \(1) \(3)  |
++--------------------------+-------------------------+------------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)       |
++--------------------------+-------------------------+------------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+* \(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |

Review comment:
       decimal256 is only used if the original type written from arrow is decimal256 or the precision exceeds decimal 128.  Arrow only writes FLBA for decimals.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou closed pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

pitrou closed pull request #8928:
URL: https://github.com/apache/arrow/pull/8928


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r544071122



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,207 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+

Review comment:
       Would it make sense (to the extent we know), to also list the features we do not support (not in the table, but eg in a sentence after the table)?
   
   For example for the encodings there are ones we do not support?

##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,207 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+------------+
+| Physical type            | Mapped Arrow type       | Notes      |
++==========================+=========================+============+
+| BOOLEAN                  | Boolean                 |            |
++--------------------------+-------------------------+------------+
+| INT32                    | Int32 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT64                    | Int64 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT96                    | Timestamp (nanoseconds) | \(2)       |
++--------------------------+-------------------------+------------+
+| FLOAT                    | Float32                 |            |
++--------------------------+-------------------------+------------+
+| DOUBLE                   | Float64                 |            |
++--------------------------+-------------------------+------------+
+| BYTE_ARRAY               | Binary / other          | \(1) \(3)  |
++--------------------------+-------------------------+------------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)       |
++--------------------------+-------------------------+------------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+* \(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.

Review comment:
       An example of this is JSON/BSON ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r543504470



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,203 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+---------+
+| Physical type            | Mapped Arrow type       | Notes   |
++==========================+=========================+=========+
+| BOOLEAN                  | Boolean                 |         |
++--------------------------+-------------------------+---------+
+| INT32                    | Int32 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT64                    | Int64 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT96                    | Timestamp (nanoseconds) | \(2)    |
++--------------------------+-------------------------+---------+
+| FLOAT                    | Float32                 |         |
++--------------------------+-------------------------+---------+
+| DOUBLE                   | Float64                 |         |
++--------------------------+-------------------------+---------+
+| BYTE_ARRAY               | Binary / other          | \(1)    |
++--------------------------+-------------------------+---------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)    |
++--------------------------+-------------------------+---------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DATE              | INT32                       | Date32                     | \(2)    |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT32                       | Time32 (milliseconds)      |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT64                       | Time64 (micro- or          |         |
+|                   |                             | nanoseconds)               |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIMESTAMP         | INT64                       | Timestamp (milli-, micro-  |         |
+|                   |                             | or nanoseconds)            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| LIST              | Any                         | List                       | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+| MAP               | Any                         | Map                        | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+
+* \(1) On the write side, the Parquet physical type INT32 is generated.
+
+* \(2) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

Review comment:
       Is it not better to map it to a Parquet timestamp? I'm assuming that some resolution is lost if mapping to int32

##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,203 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+---------+
+| Physical type            | Mapped Arrow type       | Notes   |
++==========================+=========================+=========+
+| BOOLEAN                  | Boolean                 |         |
++--------------------------+-------------------------+---------+
+| INT32                    | Int32 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT64                    | Int64 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT96                    | Timestamp (nanoseconds) | \(2)    |
++--------------------------+-------------------------+---------+
+| FLOAT                    | Float32                 |         |
++--------------------------+-------------------------+---------+
+| DOUBLE                   | Float64                 |         |
++--------------------------+-------------------------+---------+
+| BYTE_ARRAY               | Binary / other          | \(1)    |
++--------------------------+-------------------------+---------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)    |
++--------------------------+-------------------------+---------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       |         |

Review comment:
       Do you support `LargeUtf8`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r543539734



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,203 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+---------+
+| Physical type            | Mapped Arrow type       | Notes   |
++==========================+=========================+=========+
+| BOOLEAN                  | Boolean                 |         |
++--------------------------+-------------------------+---------+
+| INT32                    | Int32 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT64                    | Int64 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT96                    | Timestamp (nanoseconds) | \(2)    |
++--------------------------+-------------------------+---------+
+| FLOAT                    | Float32                 |         |
++--------------------------+-------------------------+---------+
+| DOUBLE                   | Float64                 |         |
++--------------------------+-------------------------+---------+
+| BYTE_ARRAY               | Binary / other          | \(1)    |
++--------------------------+-------------------------+---------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)    |
++--------------------------+-------------------------+---------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DATE              | INT32                       | Date32                     | \(2)    |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT32                       | Time32 (milliseconds)      |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT64                       | Time64 (micro- or          |         |
+|                   |                             | nanoseconds)               |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIMESTAMP         | INT64                       | Timestamp (milli-, micro-  |         |
+|                   |                             | or nanoseconds)            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| LIST              | Any                         | List                       | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+| MAP               | Any                         | Map                        | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+
+* \(1) On the write side, the Parquet physical type INT32 is generated.
+
+* \(2) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

Review comment:
       Thanks, also opened ARROW-10925




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r544184629



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,207 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+

Review comment:
       Yep, it could.

##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,207 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+------------+
+| Physical type            | Mapped Arrow type       | Notes      |
++==========================+=========================+============+
+| BOOLEAN                  | Boolean                 |            |
++--------------------------+-------------------------+------------+
+| INT32                    | Int32 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT64                    | Int64 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT96                    | Timestamp (nanoseconds) | \(2)       |
++--------------------------+-------------------------+------------+
+| FLOAT                    | Float32                 |            |
++--------------------------+-------------------------+------------+
+| DOUBLE                   | Float64                 |            |
++--------------------------+-------------------------+------------+
+| BYTE_ARRAY               | Binary / other          | \(1) \(3)  |
++--------------------------+-------------------------+------------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)       |
++--------------------------+-------------------------+------------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+* \(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.

Review comment:
       Right.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r543516189



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,203 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+---------+
+| Physical type            | Mapped Arrow type       | Notes   |
++==========================+=========================+=========+
+| BOOLEAN                  | Boolean                 |         |
++--------------------------+-------------------------+---------+
+| INT32                    | Int32 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT64                    | Int64 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT96                    | Timestamp (nanoseconds) | \(2)    |
++--------------------------+-------------------------+---------+
+| FLOAT                    | Float32                 |         |
++--------------------------+-------------------------+---------+
+| DOUBLE                   | Float64                 |         |
++--------------------------+-------------------------+---------+
+| BYTE_ARRAY               | Binary / other          | \(1)    |
++--------------------------+-------------------------+---------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)    |
++--------------------------+-------------------------+---------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DATE              | INT32                       | Date32                     | \(2)    |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT32                       | Time32 (milliseconds)      |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT64                       | Time64 (micro- or          |         |
+|                   |                             | nanoseconds)               |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIMESTAMP         | INT64                       | Timestamp (milli-, micro-  |         |
+|                   |                             | or nanoseconds)            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| LIST              | Any                         | List                       | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+| MAP               | Any                         | Map                        | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+
+* \(1) On the write side, the Parquet physical type INT32 is generated.
+
+* \(2) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

Review comment:
       While Arrow Date64 is represented as a number of milliseconds since the Unix epoch, it logically only represents entire days. From the Schema flatbuffers file:
   ```
   /// Date is either a 32-bit or 64-bit type representing elapsed time since UNIX
   /// epoch (1970-01-01), stored in either of two units:
   ///
   /// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
   ///   leap seconds), where the values are evenly divisible by 86400000
   /// * Days (32 bits) since the UNIX epoch
   table Date {
     unit: DateUnit = MILLISECOND;
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r543521956



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,203 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+---------+
+| Physical type            | Mapped Arrow type       | Notes   |
++==========================+=========================+=========+
+| BOOLEAN                  | Boolean                 |         |
++--------------------------+-------------------------+---------+
+| INT32                    | Int32 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT64                    | Int64 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT96                    | Timestamp (nanoseconds) | \(2)    |
++--------------------------+-------------------------+---------+
+| FLOAT                    | Float32                 |         |
++--------------------------+-------------------------+---------+
+| DOUBLE                   | Float64                 |         |
++--------------------------+-------------------------+---------+
+| BYTE_ARRAY               | Binary / other          | \(1)    |
++--------------------------+-------------------------+---------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)    |
++--------------------------+-------------------------+---------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DATE              | INT32                       | Date32                     | \(2)    |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT32                       | Time32 (milliseconds)      |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT64                       | Time64 (micro- or          |         |
+|                   |                             | nanoseconds)               |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIMESTAMP         | INT64                       | Timestamp (milli-, micro-  |         |
+|                   |                             | or nanoseconds)            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| LIST              | Any                         | List                       | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+| MAP               | Any                         | Map                        | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+
+* \(1) On the write side, the Parquet physical type INT32 is generated.
+
+* \(2) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

Review comment:
       Ah, I read that today, but it went over my head. I'll change #8926. Do/should we enforce this when creating a Date64 array, by checking if the value is divisible by 86400000?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r543532965



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,203 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+---------+
+| Physical type            | Mapped Arrow type       | Notes   |
++==========================+=========================+=========+
+| BOOLEAN                  | Boolean                 |         |
++--------------------------+-------------------------+---------+
+| INT32                    | Int32 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT64                    | Int64 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT96                    | Timestamp (nanoseconds) | \(2)    |
++--------------------------+-------------------------+---------+
+| FLOAT                    | Float32                 |         |
++--------------------------+-------------------------+---------+
+| DOUBLE                   | Float64                 |         |
++--------------------------+-------------------------+---------+
+| BYTE_ARRAY               | Binary / other          | \(1)    |
++--------------------------+-------------------------+---------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)    |
++--------------------------+-------------------------+---------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DATE              | INT32                       | Date32                     | \(2)    |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT32                       | Time32 (milliseconds)      |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT64                       | Time64 (micro- or          |         |
+|                   |                             | nanoseconds)               |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIMESTAMP         | INT64                       | Timestamp (milli-, micro-  |         |
+|                   |                             | or nanoseconds)            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| LIST              | Any                         | List                       | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+| MAP               | Any                         | Map                        | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+
+* \(1) On the write side, the Parquet physical type INT32 is generated.
+
+* \(2) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

Review comment:
       You could have an option to check that. I don't think it's mandatory to do it by default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r543535005



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,203 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+---------+
+| Physical type            | Mapped Arrow type       | Notes   |
++==========================+=========================+=========+
+| BOOLEAN                  | Boolean                 |         |
++--------------------------+-------------------------+---------+
+| INT32                    | Int32 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT64                    | Int64 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT96                    | Timestamp (nanoseconds) | \(2)    |
++--------------------------+-------------------------+---------+
+| FLOAT                    | Float32                 |         |
++--------------------------+-------------------------+---------+
+| DOUBLE                   | Float64                 |         |
++--------------------------+-------------------------+---------+
+| BYTE_ARRAY               | Binary / other          | \(1)    |
++--------------------------+-------------------------+---------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)    |
++--------------------------+-------------------------+---------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DATE              | INT32                       | Date32                     | \(2)    |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT32                       | Time32 (milliseconds)      |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT64                       | Time64 (micro- or          |         |
+|                   |                             | nanoseconds)               |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIMESTAMP         | INT64                       | Timestamp (milli-, micro-  |         |
+|                   |                             | or nanoseconds)            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| LIST              | Any                         | List                       | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+| MAP               | Any                         | Map                        | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+
+* \(1) On the write side, the Parquet physical type INT32 is generated.
+
+* \(2) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

Review comment:
       (related: ARROW-10924)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r543514265



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,203 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+---------+
+| Physical type            | Mapped Arrow type       | Notes   |
++==========================+=========================+=========+
+| BOOLEAN                  | Boolean                 |         |
++--------------------------+-------------------------+---------+
+| INT32                    | Int32 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT64                    | Int64 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT96                    | Timestamp (nanoseconds) | \(2)    |
++--------------------------+-------------------------+---------+
+| FLOAT                    | Float32                 |         |
++--------------------------+-------------------------+---------+
+| DOUBLE                   | Float64                 |         |
++--------------------------+-------------------------+---------+
+| BYTE_ARRAY               | Binary / other          | \(1)    |
++--------------------------+-------------------------+---------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)    |
++--------------------------+-------------------------+---------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       |         |

Review comment:
       Yes, it is. So is `LargeBinary`. I'll add a note, thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r543521956



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,203 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+---------+
+| Physical type            | Mapped Arrow type       | Notes   |
++==========================+=========================+=========+
+| BOOLEAN                  | Boolean                 |         |
++--------------------------+-------------------------+---------+
+| INT32                    | Int32 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT64                    | Int64 / other           | \(1)    |
++--------------------------+-------------------------+---------+
+| INT96                    | Timestamp (nanoseconds) | \(2)    |
++--------------------------+-------------------------+---------+
+| FLOAT                    | Float32                 |         |
++--------------------------+-------------------------+---------+
+| DOUBLE                   | Float64                 |         |
++--------------------------+-------------------------+---------+
+| BYTE_ARRAY               | Binary / other          | \(1)    |
++--------------------------+-------------------------+---------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)    |
++--------------------------+-------------------------+---------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DATE              | INT32                       | Date32                     | \(2)    |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT32                       | Time32 (milliseconds)      |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT64                       | Time64 (micro- or          |         |
+|                   |                             | nanoseconds)               |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIMESTAMP         | INT64                       | Timestamp (milli-, micro-  |         |
+|                   |                             | or nanoseconds)            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| LIST              | Any                         | List                       | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+| MAP               | Any                         | Map                        | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+
+* \(1) On the write side, the Parquet physical type INT32 is generated.
+
+* \(2) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

Review comment:
       Ah, I read that today, but it went over my head. I'll change #8926. Should we enforce this in Rust when creating a Date64 array, by checking if the value is divisible by 86400000?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r544081105



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,207 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+------------+
+| Physical type            | Mapped Arrow type       | Notes      |
++==========================+=========================+============+
+| BOOLEAN                  | Boolean                 |            |
++--------------------------+-------------------------+------------+
+| INT32                    | Int32 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT64                    | Int64 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT96                    | Timestamp (nanoseconds) | \(2)       |
++--------------------------+-------------------------+------------+
+| FLOAT                    | Float32                 |            |
++--------------------------+-------------------------+------------+
+| DOUBLE                   | Float64                 |            |
++--------------------------+-------------------------+------------+
+| BYTE_ARRAY               | Binary / other          | \(1) \(3)  |
++--------------------------+-------------------------+------------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)       |
++--------------------------+-------------------------+------------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+* \(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.
+
++-------------------+-----------------------------+----------------------------+---------+
+| Logical type      | Physical type               | Mapped Arrow type          | Notes   |
++===================+=============================+============================+=========+
+| NULL              | Any                         | Null                       | \(1)    |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT32                       | Int8 / UInt8 / Int16 /     |         |
+|                   |                             | UInt16 / Int32 / UInt32    |         |
++-------------------+-----------------------------+----------------------------+---------+
+| INT               | INT64                       | Int64 / UInt64             |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DECIMAL           | INT32 / INT64 / BYTE_ARRAY  | Decimal128 / Decimal256    |         |
+|                   | / FIXED_LENGTH_BYTE_ARRAY   |                            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| DATE              | INT32                       | Date32                     | \(2)    |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT32                       | Time32 (milliseconds)      |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIME              | INT64                       | Time64 (micro- or          |         |
+|                   |                             | nanoseconds)               |         |
++-------------------+-----------------------------+----------------------------+---------+
+| TIMESTAMP         | INT64                       | Timestamp (milli-, micro-  |         |
+|                   |                             | or nanoseconds)            |         |
++-------------------+-----------------------------+----------------------------+---------+
+| STRING            | BYTE_ARRAY                  | Utf8                       | \(3)    |
++-------------------+-----------------------------+----------------------------+---------+
+| LIST              | Any                         | List                       | \(4)    |
++-------------------+-----------------------------+----------------------------+---------+
+| MAP               | Any                         | Map                        |         |

Review comment:
       the read side doesn't conform to the parquet specification because repeated elements are not deduplicated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org