You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/25 20:08:00 UTC

[jira] [Issue Comment Deleted] (ARROW-10100) [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids

     [ https://issues.apache.org/jira/browse/ARROW-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-10100:
------------------------------------------
    Comment: was deleted

(was: From discussion at https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the dataset API in their parquet reader), it might be useful to somehow "subset" or read a subset of a ParquetFileFragment for a specific set of row group ids.

Use cases:

* Read only a set of row groups ids (this is similar as {{ParquetFile.read_row_groups}}), eg because you want to control the size of the resulting table by reading subsets of row groups
* Get a ParquetFileFragment with a subset of row groups (eg based on a filter) to then eg get the statistics of only those row groups

The first case could for example be solved by adding a {{row_groups}} keyword to {{ParquetFileFragment.to_table}} (but, this is then a keyword specific to the parquet format, and we should then probably also add it to {{scan}} et al).

The second case is something you can in principle do yourself manually by recreating a fragment with {{fragment.format.make_fragment(fragment.path, ..., row_groups=[...])}}. However, this is a) a bit cumbersome and b) statistics might need to be parsed again?  
The statistics of a set of filtered row groups could also be obtained by using {{split_by_row_group(filter)}} (and then get the statistics of each of the fragments), but if you then want a single fragment, you need to recreate a fragment with the obtained row group ids.

So one idea I have now (but mostly brainstorming here). Would it be useful to have a method to create a "subsetted" ParquetFileFragment, either based on a list of row group ids ({{fragment.subset(row_groups=[...])}} or either based on a filter ({{fragment.subset(filter=...)}}, which would be equivalent as split_by_row_group+recombining into a single fragment) ?

cc [~bkietz] [~rjzamora]

)

> [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10100
>                 URL: https://issues.apache.org/jira/browse/ARROW-10100
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> From discussion at https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the dataset API in their parquet reader), it might be useful to somehow "subset" or read a subset of a ParquetFileFragment for a specific set of row group ids.
> Use cases:
> * Read only a set of row groups ids (this is similar as {{ParquetFile.read_row_groups}}), eg because you want to control the size of the resulting table by reading subsets of row groups
> * Get a ParquetFileFragment with a subset of row groups (eg based on a filter) to then eg get the statistics of only those row groups
> The first case could for example be solved by adding a {{row_groups}} keyword to {{ParquetFileFragment.to_table}} (but, this is then a keyword specific to the parquet format, and we should then probably also add it to {{scan}} et al).
> The second case is something you can in principle do yourself manually by recreating a fragment with {{fragment.format.make_fragment(fragment.path, ..., row_groups=[...])}}. However, this is a) a bit cumbersome and b) statistics might need to be parsed again?  
> The statistics of a set of filtered row groups could also be obtained by using {{split_by_row_group(filter)}} (and then get the statistics of each of the fragments), but if you then want a single fragment, you need to recreate a fragment with the obtained row group ids.
> So one idea I have now (but mostly brainstorming here). Would it be useful to have a method to create a "subsetted" ParquetFileFragment, either based on a list of row group ids ({{fragment.subset(row_groups=[...])}} or either based on a filter ({{fragment.subset(filter=...)}}, which would be equivalent as split_by_row_group+recombining into a single fragment) ?
> cc [~bkietz] [~rjzamora]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)