You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/29 12:33:00 UTC
[jira] [Created] (ARROW-10131) [C++][Dataset] Lazily parse parquet
metadata / statistics in ParquetDatasetFactory and ParquetFileFragment
Joris Van den Bossche created ARROW-10131:
---------------------------------------------
Summary: [C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment
Key: ARROW-10131
URL: https://issues.apache.org/jira/browse/ARROW-10131
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
Related to ARROW-9730, parsing of the statistics in parquet metadata is expensive, and therefore should be avoided when possible.
For example, the {{ParquetDatasetFactory}} ({{ds.parquet_dataset()}} in python) parses all statistics of all files and all columns. While when doing a filtered read, you might only need the statistics of certain files (eg if a filter on a partition field already excluded many files) and certain columns (eg only the columns on which you are actually filtering).
The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a later EnsureCompleteMetadata parse all statistics, and don't allow parsing a subset, or only parsing the other (non-statistics) metadata, ...), so I think we should try to think of better abstractions.
cc [~rjzamora] [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)