You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/25 20:08:00 UTC

[jira] [Commented] (ARROW-10100) [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids

    [ https://issues.apache.org/jira/browse/ARROW-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202405#comment-17202405 ] 

Joris Van den Bossche commented on ARROW-10100:
-----------------------------------------------

From discussion at https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the dataset API in their parquet reader), it might be useful to somehow "subset" or read a subset of a ParquetFileFragment for a specific set of row group ids.

Use cases:

* Read only a set of row groups ids (this is similar as {{ParquetFile.read_row_groups}}), eg because you want to control the size of the resulting table by reading subsets of row groups
* Get a ParquetFileFragment with a subset of row groups (eg based on a filter) to then eg get the statistics of only those row groups

The first case could for example be solved by adding a {{row_groups}} keyword to {{ParquetFileFragment.to_table}} (but, this is then a keyword specific to the parquet format, and we should then probably also add it to {{scan}} et al).

The second case is something you can in principle do yourself manually by recreating a fragment with {{fragment.format.make_fragment(fragment.path, ..., row_groups=[...])}}. However, this is a) a bit cumbersome and b) statistics might need to be parsed again?  
The statistics of a set of filtered row groups could also be obtained by using {{split_by_row_group(filter)}} (and then get the statistics of each of the fragments), but if you then want a single fragment, you need to recreate a fragment with the obtained row group ids.

So one idea I have now (but mostly brainstorming here). Would it be useful to have a method to create a "subsetted" ParquetFileFragment, either based on a list of row group ids ({{fragment.subset(row_groups=[...])}} or either based on a filter ({{fragment.subset(filter=...)}}, which would be equivalent as split_by_row_group+recombining into a single fragment) ?

cc [~bkietz] [~rjzamora]



> [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10100
>                 URL: https://issues.apache.org/jira/browse/ARROW-10100
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Joris Van den Bossche
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)