You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/03/15 17:22:00 UTC

[jira] [Created] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

Nicola Crane created ARROW-15943:
------------------------------------

             Summary: [C++] Filter which files to be read in as part of filesystem, filtered using a string
                 Key: ARROW-15943
                 URL: https://issues.apache.org/jira/browse/ARROW-15943
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Nicola Crane


There is a report from a user (see this Stack Overflow post [1]) who has used the {{basename_template}} parameter to write files to a dataset, some of which have the prefix {{"summary"}} and others which have the prefix "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They want to be able to read back in the data, so that, as well as the partition variables in their dataset, they can choose which subset (predictions vs. summaries) to read back in.  

This isn't currently possible; if they try to open a dataset with a list of files, they cannot read it in as partitioned data.

A short-term solution is to suggest they change the structure of how their data is stored, but it could be useful to be able to pass in some sort of filter to determine which files get read in as a dataset.

 

[1] [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)