You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2020/09/30 20:21:00 UTC

[jira] [Assigned] (ARROW-10100) [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids

     [ https://issues.apache.org/jira/browse/ARROW-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ben Kietzman reassigned ARROW-10100:
------------------------------------

    Assignee: Joris Van den Bossche

> [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10100
>                 URL: https://issues.apache.org/jira/browse/ARROW-10100
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration, pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> From discussion at https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the dataset API in their parquet reader), it might be useful to somehow "subset" or read a subset of a ParquetFileFragment for a specific set of row group ids.
> Use cases:
> * Read only a set of row groups ids (this is similar as {{ParquetFile.read_row_groups}}), eg because you want to control the size of the resulting table by reading subsets of row groups
> * Get a ParquetFileFragment with a subset of row groups (eg based on a filter) to then eg get the statistics of only those row groups
> The first case could for example be solved by adding a {{row_groups}} keyword to {{ParquetFileFragment.to_table}} (but, this is then a keyword specific to the parquet format, and we should then probably also add it to {{scan}} et al).
> The second case is something you can in principle do yourself manually by recreating a fragment with {{fragment.format.make_fragment(fragment.path, ..., row_groups=[...])}}. However, this is a) a bit cumbersome and b) statistics might need to be parsed again?  
> The statistics of a set of filtered row groups could also be obtained by using {{split_by_row_group(filter)}} (and then get the statistics of each of the fragments), but if you then want a single fragment, you need to recreate a fragment with the obtained row group ids.
> So one idea I have now (but mostly brainstorming here). Would it be useful to have a method to create a "subsetted" ParquetFileFragment, either based on a list of row group ids ({{fragment.subset(row_groups=[...])}} or either based on a filter ({{fragment.subset(filter=...)}}, which would be equivalent as split_by_row_group+recombining into a single fragment) ?
> cc [~bkietz] [~rjzamora]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)