You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2021/03/12 15:58:00 UTC

[jira] [Resolved] (ARROW-8658) [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments

     [ https://issues.apache.org/jira/browse/ARROW-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ben Kietzman resolved ARROW-8658.
---------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 9670
[https://github.com/apache/arrow/pull/9670]

> [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-8658
>                 URL: https://issues.apache.org/jira/browse/ARROW-8658
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 0.17.0
>            Reporter: Ben Kietzman
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 5h
>  Remaining Estimate: 0h
>
> This is a very handy optimization for large datasets with multiple partition fields. For example, given a hive-style directory {{$base_dir/a=3/}} and a filter {{"a"_ == 2}} none of its files or subdirectories need be examined.
> After ARROW-8318 FileSystemDataset stores only files so subtree pruning (whose implementation depended on the presence of directories to represent subtrees) was disabled. It should be possible to reintroduce this without reference to directories by examining partition expressions directly and extracting a tree structure from their subexpressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)