You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/03/10 18:16:00 UTC

[jira] [Created] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

Joris Van den Bossche created ARROW-8062:
--------------------------------------------

             Summary: [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file
                 Key: ARROW-8062
                 URL: https://issues.apache.org/jira/browse/ARROW-8062
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++ - Dataset, Python
            Reporter: Joris Van den Bossche


Partitioned parquet datasets sometimes come with {{_metadata}} / {{_common_metadata}} files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for {{_metadata}}).

Using those files during the creation of a parquet {{Dataset}} can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory).

Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed.   
Such logic could be put in a different factory class, eg {{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)