You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/06/18 19:59:00 UTC

[jira] [Commented] (ARROW-8733) [C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata

    [ https://issues.apache.org/jira/browse/ARROW-8733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139960#comment-17139960 ] 

Joris Van den Bossche commented on ARROW-8733:
----------------------------------------------

bq. So, do we re-use the FileMetaData taken from the `_metadata` file (if constructed that way), or do we always pay the filesystem cost whenever the user requests this metadata?

For dask's use case, being able to access the metadata (or at least the statistics) for a certain fragment without the filesystem cost (in case the dataset/fragments were constructed with the ParquetFactory from a {{_metadata}} file) is quite important I think.

And I think ideally this can be accessed from the Fragments. Because if the fragments are filtered, there is no easy way for dask to process the {{_metadata}} file themselves to gather all the statistics, as they would then need to repeat the filtering to get the statistics matching the filtered fragments .. (which is not going to very robust if there are small variations in filtering logic, plus it would be duplicating this logic).



> [C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-8733
>                 URL: https://issues.apache.org/jira/browse/ARROW-8733
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration
>
> Related to ARROW-8062 (as there we will also need a way to expose the global FileMetadata). But independently, it would be useful to get access to the FileMetadata on each {{ParquetFileFragment}} (eg to get access to the statistics).
> This would be relatively simple to code on the Python/R side, since we have access to the file path, and could read the metadata from the file backing the fragment, and return this as a FileMetadata object. 
> I am wondering if we want to integrate this with ARROW-8062, since when the fragments were created from a {{_metadata}} file, a {{ParquetFileFragment.metadata}} attribute would not need to read it from the parquet file in this case, but from the global metadata (at least for eg the row group data).
> Another question: what for a ParquetFileFragment that maps to a single row group?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)