You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/01/22 02:24:00 UTC

[jira] [Commented] (ARROW-15413) [C++][Datasets] Investigate sub-batch IPC reads

    [ https://issues.apache.org/jira/browse/ARROW-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480328#comment-17480328 ] 

Weston Pace commented on ARROW-15413:
-------------------------------------

[~lidavidm] We discussed this once before I think.  Can I get a quick sanity check that this should indeed be possible?

> [C++][Datasets] Investigate sub-batch IPC reads
> -----------------------------------------------
>
>                 Key: ARROW-15413
>                 URL: https://issues.apache.org/jira/browse/ARROW-15413
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> When scanning an IPC file the finest resolution we can read currently is a record batch.  Often we are processing relatively small slices of that batch in an iterative fashion.  This means we sometimes have to read in and hold a huge batch of memory while we slice off small pieces of it.
> For example, if a user creates an IPC file with 1 record batch with 50 million rows and we want to process it in batches of 64K rows we have to first read the entire 50 million rows in memory and then slice off the 64K sub-batches.
> We should be able to create a sub-batch reader (although this will be more complicated in the future with things like RLE columns) which can slice small pieces of the batch off the disk instead of reading the entire batch into memory first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)