You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Francois Saint-Jacques (Jira)" <ji...@apache.org> on 2019/11/21 12:43:00 UTC

[jira] [Commented] (ARROW-7224) [Python] Partition level filters should be able to provide filtering to file systems

    [ https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979242#comment-16979242 ] 

Francois Saint-Jacques commented on ARROW-7224:
-----------------------------------------------

There's a confusion between the new dataset API (in C++) and the existing ParquetDataset that is purely in python.

> [Python] Partition level filters should be able to provide filtering to file systems
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-7224
>                 URL: https://issues.apache.org/jira/browse/ARROW-7224
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Micah Kornfield
>            Priority: Major
>
> When providing a filter for partitions, it should be possible in some cases to use it to optimize file system list calls.  This can greatly improve the speed for reading data from partitions because fewer number of directories/files need to be explored/expanded.  I've fallen behind on the dataset code, but I want to make sure this issue is tracked someplace.  This came up in SO question linked below (feel free to correct my analysis if I missed the functionality someplace).
> Reference: [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)