You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2022/04/27 21:04:00 UTC

[jira] [Resolved] (ARROW-16294) [C++] Improve performance of parquet readahead

     [ https://issues.apache.org/jira/browse/ARROW-16294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Li resolved ARROW-16294.
------------------------------
    Fix Version/s: 9.0.0
       Resolution: Fixed

Issue resolved by pull request 12967
[https://github.com/apache/arrow/pull/12967]

> [C++] Improve performance of parquet readahead
> ----------------------------------------------
>
>                 Key: ARROW-16294
>                 URL: https://issues.apache.org/jira/browse/ARROW-16294
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 9.0.0
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The 7.0.0 readahead for parquet would read up to 256 row groups at once which meant that, if the consumer were too slow, we would almost certainly run out of memory.
> ARROW-15410 improved readahead as a whole and, in the process, changed parquet so it's always  reading 1 row group in advance.
> This is not always ideal in S3 scenarios.  We may want to read many row groups in advance if the row groups are small.  To fix this we should continue reading in parallel until there are at least batch_size * batch_readahead rows being fetched.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)