You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/09/23 19:24:00 UTC

[jira] [Resolved] (ARROW-14024) [C++] ScanOptions::batch_size not respected in parquet/IPC readers

     [ https://issues.apache.org/jira/browse/ARROW-14024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Li resolved ARROW-14024.
------------------------------
    Resolution: Fixed

Issue resolved by pull request 11207
[https://github.com/apache/arrow/pull/11207]

> [C++] ScanOptions::batch_size not respected in parquet/IPC readers
> ------------------------------------------------------------------
>
>                 Key: ARROW-14024
>                 URL: https://issues.apache.org/jira/browse/ARROW-14024
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: David Li
>            Priority: Major
>              Labels: pull-request-available, query-engine
>             Fix For: 6.0.0
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> At first glance it seems like Parquet's reader should work.  The ScanOptions::batch_size property is forwarded into the ArrowReaderProperties for the parquet::arrow::FileReader.  However, we then use ReadOneRowGroup which doesn't look at the batch_size option.
> The IPC reader simply doesn't look at the property at all.
> Even if we can't control the source read size (e.g. we have to read a full row group / record batch and have no control over its size) we can still split whatever we read into smaller batches that respect the batch size.  This is important for achieving parallelism as we can then partition the CPU work across these batches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)