You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/10/25 16:44:00 UTC

[jira] [Created] (ARROW-18160) [C++] Scanner slicing large row groups leads to inefficient RAM usage

Weston Pace created ARROW-18160:
-----------------------------------

             Summary: [C++] Scanner slicing large row groups leads to inefficient RAM usage
                 Key: ARROW-18160
                 URL: https://issues.apache.org/jira/browse/ARROW-18160
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


As an example, consider a 4GB parquet file with 1 giant row group.  At the moment it is inevitable that we read this in as one large 4GB record batch (there are other JIRAs for sub-row-group reads which, if implemented, would obsolete this one).

We then slice off pieces of that 4GB parquet file for processing:

{noformat}
next_batch = current.slice(0, batch_size)
current = current.slice(batch_size)
{noformat}

However, even though {{current}} is shrinking each time, it always references the entire data (slicing doesn't allow memory to be freed).  We may want to investigate alternative strategies here so that we can free up memory when we are done processing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)