You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andy Douglas (Jira)" <ji...@apache.org> on 2021/03/08 19:27:00 UTC

[jira] [Comment Edited] (ARROW-7224) [C++][Dataset] Partition level filters should be able to provide filtering to file systems

    [ https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297661#comment-17297661 ] 

Andy Douglas edited comment on ARROW-7224 at 3/8/21, 7:26 PM:
--------------------------------------------------------------

[~bkietz]  

> directories viewed by datasets are only listed once (on construction)

right but for a dataset containing a large number of parquet files (> 100k) the construction can take a long time so too can querying the dataset for a particularly partition. What I was suggesting is the ability to load a cached copy of the mapping from partition to parquet file. Clearly this cache would be invalidated when the dataset is written to but I have lots of datasets that are read more than they are written, so for me the caching works well. Both the initial load and subsequent querying are much faster (seconds not minutes for the initial load and then tens of seconds for the query)


was (Author: andydoug):
[~bkietz]  

> directories viewed by datasets are only listed once (on construction)

right but for a dataset containing a large number of parquet files (> 100k) the construction can take a long time so too can querying the dataset for a particularly partition. What I was suggesting is the ability to load a cached copy of the dataset *files* as a dataset i.e. the mapping from partition to parquet file. Clearly this cache would be invalidated when the dataset is written to but I have lots of datasets that are read more than they are written, so for me the caching works well. Both the initial load and subsequent querying are much faster (seconds not minutes for the initial load and then tens of seconds for the query)

> [C++][Dataset] Partition level filters should be able to provide filtering to file systems
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7224
>                 URL: https://issues.apache.org/jira/browse/ARROW-7224
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Micah Kornfield
>            Priority: Major
>              Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases to use it to optimize file system list calls.  This can greatly improve the speed for reading data from partitions because fewer number of directories/files need to be explored/expanded.  I've fallen behind on the dataset code, but I want to make sure this issue is tracked someplace.  This came up in SO question linked below (feel free to correct my analysis if I missed the functionality someplace).
> Reference: [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)