You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by jo...@apache.org on 2020/06/10 11:24:53 UTC

[arrow] branch master updated: ARROW-3154: [Python] Expand documentation on Parquet metadata inspection and writing of _metadata

This is an automated email from the ASF dual-hosted git repository.

jorisvandenbossche pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 0b7072f  ARROW-3154: [Python] Expand documentation on Parquet metadata inspection and writing of _metadata
0b7072f is described below

commit 0b7072fdfaea640fa29c6e8553525a34172ad3a6
Author: Joris Van den Bossche <jo...@gmail.com>
AuthorDate: Wed Jun 10 13:24:29 2020 +0200

    ARROW-3154: [Python] Expand documentation on Parquet metadata inspection and writing of _metadata
    
    In addition to https://issues.apache.org/jira/browse/ARROW-3154, this also closes https://issues.apache.org/jira/browse/ARROW-3275
    
    And while going through the parquet docs, also clarified the `coerce_timestamps` default.
    
    Closes #7348 from jorisvandenbossche/ARROW-3154-parquet-metadata-docs
    
    Authored-by: Joris Van den Bossche <jo...@gmail.com>
    Signed-off-by: Joris Van den Bossche <jo...@gmail.com>
---
 docs/source/python/parquet.rst | 120 +++++++++++++++++++++++++++++++++++++++--
 python/pyarrow/parquet.py      |   8 ++-
 2 files changed, 124 insertions(+), 4 deletions(-)

diff --git a/docs/source/python/parquet.rst b/docs/source/python/parquet.rst
index fb1a10b..3451e85 100644
--- a/docs/source/python/parquet.rst
+++ b/docs/source/python/parquet.rst
@@ -56,7 +56,7 @@ Reading and Writing Single Files
 --------------------------------
 
 The functions :func:`~.parquet.read_table` and :func:`~.parquet.write_table`
-read and write the :ref:`pyarrow.Table <data.table>` objects, respectively.
+read and write the :ref:`pyarrow.Table <data.table>` object, respectively.
 
 Let's look at a simple table:
 
@@ -126,6 +126,8 @@ control various settings when writing a Parquet file.
 * ``flavor``, to set compatibility options particular to a Parquet
   consumer like ``'spark'`` for Apache Spark.
 
+See the :func:`~pyarrow.parquet.write_table()` docstring for more details.
+
 There are some additional data type handling-specific options
 described below.
 
@@ -199,6 +201,33 @@ Alternatively python ``with`` syntax can also be use:
        for i in range(3):
            writer.write_table(table)
 
+Inspecting the Parquet File Metadata
+------------------------------------
+
+The ``FileMetaData`` of a Parquet file can be accessed through
+:class:`~.ParquetFile` as shown above:
+
+.. ipython:: python
+
+   parquet_file = pq.ParquetFile('example.parquet')
+   metadata = parquet_file.metadata
+
+or can also be read directly using :func:`~parquet.read_metadata`:
+
+.. ipython:: python
+
+   metadata = pq.read_metadata('example.parquet')
+   metadata
+
+The returned ``FileMetaData`` object allows to inspect the
+`Parquet file metadata <https://github.com/apache/parquet-format#metadata>`__,
+such as the row groups and column chunk metadata and statistics:
+
+.. ipython:: python
+
+   metadata.row_group(0)
+   metadata.row_group(0).column(0)
+
 .. ipython:: python
    :suppress:
 
@@ -228,8 +257,12 @@ Storing timestamps
 
 Some Parquet readers may only support timestamps stored in millisecond
 (``'ms'``) or microsecond (``'us'``) resolution. Since pandas uses nanoseconds
-to represent timestamps, this can occasionally be a nuisance. We provide the
-``coerce_timestamps`` option to allow you to select the desired resolution:
+to represent timestamps, this can occasionally be a nuisance. By default
+(when writing version 1.0 Parquet files), the nanoseconds will be cast to
+microseconds ('us').
+
+In addition, We provide the ``coerce_timestamps`` option to allow you to select
+the desired resolution:
 
 .. code-block:: python
 
@@ -244,6 +277,18 @@ an exception will be raised. This can be suppressed by passing
    pq.write_table(table, where, coerce_timestamps='ms',
                   allow_truncated_timestamps=True)
 
+Timestamps with nanoseconds can be stored without casting when using the
+more recent Parquet format version 2.0:
+
+.. code-block:: python
+
+   pq.write_table(table, where, version='2.0')
+
+However, many Parquet readers do not yet support this newer format version, and
+therefore the default is to write version 1.0 files. When compatibility across
+different processing frameworks is required, it is recommended to use the
+default version 1.0.
+
 Older Parquet implementations use ``INT96`` based storage of
 timestamps, but this is now deprecated. This includes some older
 versions of Apache Impala and Apache Spark. To write timestamps in
@@ -350,6 +395,75 @@ Compatibility Note: if using ``pq.write_to_dataset`` to create a table that
 will then be used by HIVE then partition column values must be compatible with
 the allowed character set of the HIVE version you are running.
 
+Writing ``_metadata`` and ``_common_medata`` files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some processing frameworks such as Spark or Dask (optionally) use ``_metadata``
+and ``_common_metadata`` files with partitioned datasets.
+
+Those files include information about the schema of the full dataset (for
+``_common_metadata``) and potentially all row group metadata of all files in the
+partitioned dataset as well (for ``_metadata``). The actual files are
+metadata-only Parquet files. Note this is not a Parquet standard, but a
+convention set in practice by those frameworks.
+
+Using those files can give a more efficient creation of a parquet Dataset,
+since it can use the stored schema and and file paths of all row groups,
+instead of inferring the schema and crawling the directories for all Parquet
+files (this is especially the case for filesystems where accessing files
+is expensive).
+
+The :func:`~pyarrow.parquet.write_to_dataset` function does not automatically
+write such metadata files, but you can use it to gather the metadata and
+combine and write them manually:
+
+.. code-block:: python
+
+   # Write a dataset and collect metadata information of all written files
+   metadata_collector = []
+   pq.write_to_dataset(table, root_path, metadata_collector=metadata_collector)
+
+   # Write the ``_common_metadata`` parquet file without row groups statistics
+   pq.write_metadata(table.schema, root_path / '_common_metadata')
+
+   # Write the ``_metadata`` parquet file with row groups statistics of all files
+   pq.write_metadata(
+       table.schema, root_path / '_metadata',
+       metadata_collector=metadata_collector
+   )
+
+When not using the :func:`~pyarrow.parquet.write_to_dataset` function, but
+writing the individual files of the partitioned dataset using
+:func:`~pyarrow.parquet.write_table` or :class:`~pyarrow.parquet.ParquetWriter`,
+the ``metadata_collector`` keyword can also be used to collect the FileMetaData
+of the written files. In this case, you need to ensure to set the file path
+contained in the row group metadata yourself before combining the metadata, and
+the schemas of all different files and collected FileMetaData objects should be
+the same:
+
+.. code-block:: python
+
+   metadata_collector = []
+   pq.write_table(
+       table1, root_path / "year=2017/data1.parquet",
+       metadata_collector=metadata_collector
+   )
+
+   # set the file path relative to the root of the partitioned dataset
+   metadata_collector[-1].set_file_path("year=2017/data1.parquet")
+
+   # combine and write the metadata
+   metadata = metadata_collector[0]
+   for _meta in metadata_collector[1:]:
+       metadata.append_row_groups(_meta)
+   metadata.write_metadata_file(root_path / "_metadata")
+
+   # or use pq.write_metadata to combine and write in a single step
+   pq.write_metadata(
+       table1.schema, root_path / "_metadata",
+       metadata_collector=metadata_collector
+   )
+
 Reading from Partitioned Datasets
 ------------------------------------------------
 
diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py
index 1059a40..f2f4ac1 100644
--- a/python/pyarrow/parquet.py
+++ b/python/pyarrow/parquet.py
@@ -439,7 +439,13 @@ use_deprecated_int96_timestamps : bool, default None
     Write timestamps to INT96 Parquet format. Defaults to False unless enabled
     by flavor argument. This take priority over the coerce_timestamps option.
 coerce_timestamps : str, default None
-    Cast timestamps a particular resolution.
+    Cast timestamps a particular resolution. The defaults depends on `version`.
+    For ``version='1.0'`` (the default), nanoseconds will be cast to
+    microseconds ('us'), and seconds to milliseconds ('ms') by default. For
+    ``version='2.0'``, the original resolution is preserved and no casting
+    is done by default. The casting might result in loss of data, in which
+    case ``allow_truncated_timestamps=True`` can be used to suppress the
+    raised exception.
     Valid values: {None, 'ms', 'us'}
 data_page_size : int, default None
     Set a target threshold for the approximate encoded size of data