You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Andy Douglas (Jira)" <ji...@apache.org> on 2021/02/27 11:00:00 UTC

[jira] [Commented] (ARROW-7224) [C++][Dataset] Partition level filters should be able to provide filtering to file systems

    [ https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292100#comment-17292100 ] 

Andy Douglas commented on ARROW-7224:
-------------------------------------

Is there any update on this?

I'm also finding that instantiation a pyarrrow dataset containing a large number of files is slow even when passing paths explicitly. I've tried dropping to the parquet dataset interface and disabling schema validation, but it's still slow. 

In general, is there a way of caching information regarding partitions/files perhaps within the metadata? I was thinking perhaps of a hierarchical setup which was supported by the query language where the query is initially evaluated on the partition/files cache (if present) to determine the list of relevant files. Then a dataset is instantiated by explicitly passing the list of relevant files before finally evaluating the query on this. This could be supported outside of pyarrow but I've struggled to find a way to evaluate the parts of the query relevant to the partitions without separating out into a separate query which is clunky.

> [C++][Dataset] Partition level filters should be able to provide filtering to file systems
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7224
>                 URL: https://issues.apache.org/jira/browse/ARROW-7224
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Micah Kornfield
>            Priority: Major
>              Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases to use it to optimize file system list calls.  This can greatly improve the speed for reading data from partitions because fewer number of directories/files need to be explored/expanded.  I've fallen behind on the dataset code, but I want to make sure this issue is tracked someplace.  This came up in SO question linked below (feel free to correct my analysis if I missed the functionality someplace).
> Reference: [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)