You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Robert Gruener (JIRA)" <ji...@apache.org> on 2018/06/27 16:12:00 UTC

[jira] [Commented] (ARROW-2656) [Python] Improve ParquetManifest creation time

    [ https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525236#comment-16525236 ] 

Robert Gruener commented on ARROW-2656:
---------------------------------------

I have opened [https://github.com/apache/arrow/pull/2185] which is a quick win that gives a rather large performance boost for creating a parquet manifest on a partitioned dataset which lives in hdfs.

I think using the summary file will be another improvement to make on top of this, but this is extremely useful in the meantime.

> [Python] Improve ParquetManifest creation time 
> -----------------------------------------------
>
>                 Key: ARROW-2656
>                 URL: https://issues.apache.org/jira/browse/ARROW-2656
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Robert Gruener
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a parquet dataset is highly partitioned, the time to call the constructor for [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588] takes a significant amount of time since it serially visits directories to find all parquet files. In a dataset with thousands of partition values this can take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool to have calls to {{_visit_level}} happen concurrently to prevent wasting a ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset metadata in something like the {{common_metadata}}. This could contain all files in the manifest and their row_group information. This would also allow for [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746] to be implemented efficiently without needing to open every parquet file in the dataset to retrieve the metadata which is quite time consuming for large datasets. The main problem with the indexing approach are it requires immutability of the dataset, which doesn't seem too unreasonable. This specific implementation seems related to https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)