You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/04 17:32:33 UTC

[GitHub] [arrow] fsaintjacques commented on a change in pull request #7348: ARROW-3154: [Python] Expand documentation on Parquet metadata inspection and writing of _metadata

fsaintjacques commented on a change in pull request #7348:
URL: https://github.com/apache/arrow/pull/7348#discussion_r435427671



##########
File path: docs/source/python/parquet.rst
##########
@@ -350,6 +395,73 @@ Compatibility Note: if using ``pq.write_to_dataset`` to create a table that
 will then be used by HIVE then partition column values must be compatible with
 the allowed character set of the HIVE version you are running.
 
+Writing ``_metadata`` and ``_common_medata`` files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some processing frameworks such as Spark or Dask (optionally) use ``metadata``

Review comment:
       ```suggestion
   Some processing frameworks such as Spark or Dask (optionally) use ``_metadata``
   ```

##########
File path: docs/source/python/parquet.rst
##########
@@ -350,6 +395,73 @@ Compatibility Note: if using ``pq.write_to_dataset`` to create a table that
 will then be used by HIVE then partition column values must be compatible with
 the allowed character set of the HIVE version you are running.
 
+Writing ``_metadata`` and ``_common_medata`` files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some processing frameworks such as Spark or Dask (optionally) use ``metadata``
+and ``_common_metadata`` files with partitioned datasets.
+
+Those files include information about the schema of the full dataset (for
+``_common_metadata``) and potentially all row group metadata of all files in the
+partitioned dataset as well (for ``_metadata``). The actual files are
+metadata-only Parquet files. Note this is not a Parquet standard, but a
+convention set in practice by those frameworks.
+
+Using those files can give a more efficient creation of a parquet Dataset,
+since it can use the stored schema and and file paths of all row groups,
+instead of inferring the schema and crawling the directories for all Parquet
+files (this is especially the case for filesystems where listing files

Review comment:
       ```suggestion
   files (this is especially the case for filesystems where accessing files
   ```

##########
File path: docs/source/python/parquet.rst
##########
@@ -350,6 +395,73 @@ Compatibility Note: if using ``pq.write_to_dataset`` to create a table that
 will then be used by HIVE then partition column values must be compatible with
 the allowed character set of the HIVE version you are running.
 
+Writing ``_metadata`` and ``_common_medata`` files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some processing frameworks such as Spark or Dask (optionally) use ``metadata``
+and ``_common_metadata`` files with partitioned datasets.
+
+Those files include information about the schema of the full dataset (for
+``_common_metadata``) and potentially all row group metadata of all files in the
+partitioned dataset as well (for ``_metadata``). The actual files are
+metadata-only Parquet files. Note this is not a Parquet standard, but a
+convention set in practice by those frameworks.
+
+Using those files can give a more efficient creation of a parquet Dataset,
+since it can use the stored schema and and file paths of all row groups,
+instead of inferring the schema and crawling the directories for all Parquet
+files (this is especially the case for filesystems where listing files
+is expensive).
+
+The :func:`~pyarrow.parquet.write_to_dataset` function does not automatically
+write such metadata files, but you can use it to gather the metadata and
+combine and write them manually:
+
+.. code-block:: python
+
+   # Write a dataset and collect metadata information of all written files
+   metadata_collector = []
+   pq.write_to_dataset(table, root_path, metadata_collector=metadata_collector)
+
+   # Write the ``_common_metadata`` parquet file without row groups statistics
+   pq.write_metadata(table.schema, root_path / '_common_metadata')
+
+   # Write the ``_metadata`` parquet file with row groups statistics of all files
+   pq.write_metadata(
+       table.schema, root_path / '_metadata',
+       metadata_collector=metadata_collector
+   )
+
+When not using the :func:`~pyarrow.parquet.write_to_dataset` function, but
+writing the individual files of the partitioned dataset using
+:func:`~pyarrow.parquet.write_table` or :class:`~pyarrow.parquet.ParquetWriter`,
+the ``metadata_collector`` keyword can also be used to collect the FileMetaData
+of the written files. In this case, you need to ensure to set the file path
+contained in the row group metadata yourself before combining the metadata:

Review comment:
       Add a warning about schema equality requirement.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org