You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/03/02 21:02:00 UTC

[jira] [Updated] (ARROW-15410) [C++][Datasets] Improve memory usage of datasets API when scanning parquet

     [ https://issues.apache.org/jira/browse/ARROW-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated ARROW-15410:
-----------------------------------
    Labels: pull-request-available  (was: )

> [C++][Datasets] Improve memory usage of datasets API when scanning parquet
> --------------------------------------------------------------------------
>
>                 Key: ARROW-15410
>                 URL: https://issues.apache.org/jira/browse/ARROW-15410
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a more targeted fix to improve memory usage when scanning parquet files.  It is related to broader issues like ARROW-14648 but those will likely take longer to fix.  The goal here is to make it possible to scan large parquet datasets with many files where each file has reasonably sized row groups (e.g. 1 million rows).  Currently we run out of memory scanning a configuration as simple as:
> 21 parquet files
> Each parquet file has 10 million rows split into row groups of size 1 million



--
This message was sent by Atlassian Jira
(v8.20.1#820001)