You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/16 07:34:47 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8928: ARROW-10918: [Doc][C++] Document supported Parquet features

jorisvandenbossche commented on a change in pull request #8928:
URL: https://github.com/apache/arrow/pull/8928#discussion_r544071122



##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,207 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+

Review comment:
       Would it make sense (to the extent we know), to also list the features we do not support (not in the table, but eg in a sentence after the table)?
   
   For example for the encodings there are ones we do not support?

##########
File path: docs/source/cpp/parquet.rst
##########
@@ -27,15 +27,207 @@ Reading and writing Parquet files
 .. seealso::
    :ref:`Parquet reader and writer API reference <cpp-api-parquet>`.
 
-The Parquet C++ library is part of the Apache Arrow project and benefits
-from tight integration with Arrow C++.
+The `Parquet format <https://parquet.apache.org/documentation/latest/>`__
+is a space-efficient columnar storage format for complex data.  The Parquet
+C++ implementation is part of the Apache Arrow project and benefits
+from tight integration with the Arrow C++ classes and facilities.
+
+Supported Parquet features
+==========================
+
+The Parquet format has many features, and Parquet C++ supports a subset of them.
+
+Page types
+----------
+
++-------------------+---------+
+| Page type         | Notes   |
++===================+=========+
+| DATA_PAGE         |         |
++-------------------+---------+
+| DATA_PAGE_V2      |         |
++-------------------+---------+
+| DICTIONARY_PAGE   |         |
++-------------------+---------+
+
+Compression
+-----------
+
++-------------------+---------+
+| Compression codec | Notes   |
++===================+=========+
+| SNAPPY            |         |
++-------------------+---------+
+| GZIP              |         |
++-------------------+---------+
+| BROTLI            |         |
++-------------------+---------+
+| LZ4               | \(1)    |
++-------------------+---------+
+| ZSTD              |         |
++-------------------+---------+
+
+* \(1) On the read side, Parquet C++ is able to decompress both the regular
+  LZ4 block format and the ad-hoc Hadoop LZ4 format used by the
+  `reference Parquet implementation <https://github.com/apache/parquet-mr>`__.
+  On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
+
+Encodings
+---------
+
++--------------------------+---------+
+| Encoding                 | Notes   |
++==========================+=========+
+| PLAIN                    |         |
++--------------------------+---------+
+| PLAIN_DICTIONARY         |         |
++--------------------------+---------+
+| BIT_PACKED               |         |
++--------------------------+---------+
+| RLE                      | \(1)    |
++--------------------------+---------+
+| RLE_DICTIONARY           | \(2)    |
++--------------------------+---------+
+| BYTE_STREAM_SPLIT        |         |
++--------------------------+---------+
+
+* \(1) Only supported for encoding definition and repetition levels, not values.
+
+* \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version
+  2.0 (or potentially greater) is selected in :func:`WriterProperties::version`.
+
+Types
+-----
+
+Physical types
+~~~~~~~~~~~~~~
+
++--------------------------+-------------------------+------------+
+| Physical type            | Mapped Arrow type       | Notes      |
++==========================+=========================+============+
+| BOOLEAN                  | Boolean                 |            |
++--------------------------+-------------------------+------------+
+| INT32                    | Int32 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT64                    | Int64 / other           | \(1)       |
++--------------------------+-------------------------+------------+
+| INT96                    | Timestamp (nanoseconds) | \(2)       |
++--------------------------+-------------------------+------------+
+| FLOAT                    | Float32                 |            |
++--------------------------+-------------------------+------------+
+| DOUBLE                   | Float64                 |            |
++--------------------------+-------------------------+------------+
+| BYTE_ARRAY               | Binary / other          | \(1) \(3)  |
++--------------------------+-------------------------+------------+
+| FIXED_LENGTH_BYTE_ARRAY  | FixedSizeBinary / other | \(1)       |
++--------------------------+-------------------------+------------+
+
+* \(1) Can be mapped to other Arrow types, depending on the logical type
+  (see below).
+
+* \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps`
+  must be enabled.
+
+* \(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY.
+
+Logical types
+~~~~~~~~~~~~~
+
+Specific logical types can override the default Arrow type mapping for a given
+physical type.  If the Parquet file contains an unrecognized logical type,
+the default physical type mapping is used.

Review comment:
       An example of this is JSON/BSON ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org