You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Francois Saint-Jacques (Jira)" <ji...@apache.org> on 2020/05/25 19:21:00 UTC
[jira] [Resolved] (ARROW-8062) [C++][Dataset] Parquet Dataset
factory from a _metadata/_common_metadata file
[ https://issues.apache.org/jira/browse/ARROW-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Francois Saint-Jacques resolved ARROW-8062.
-------------------------------------------
Resolution: Fixed
Issue resolved by pull request 7180
[https://github.com/apache/arrow/pull/7180]
> [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file
> -----------------------------------------------------------------------------
>
> Key: ARROW-8062
> URL: https://issues.apache.org/jira/browse/ARROW-8062
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Joris Van den Bossche
> Assignee: Francois Saint-Jacques
> Priority: Major
> Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 1.0.0
>
> Time Spent: 3h 40m
> Remaining Estimate: 0h
>
> Partitioned parquet datasets sometimes come with {{_metadata}} / {{_common_metadata}} files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for {{_metadata}}).
> Using those files during the creation of a parquet {{Dataset}} can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory).
> Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed.
> Such logic could be put in a different factory class, eg {{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)