You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/09/01 17:12:00 UTC

[jira] [Commented] (ARROW-17593) [C++] Try and maintain input shape in Acero

    [ https://issues.apache.org/jira/browse/ARROW-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599079#comment-17599079 ] 

Will Jones commented on ARROW-17593:
------------------------------------

I've been reading through the Parquet implementation, and was surprised to find that you cannot write out a row group with multiple batches. We've decoupled row group sizes and batch size on read (great!), but not on write. Perhaps that should also be part of the solution.

I'm not deeply familiar with Acero internals yet, but what you've described here seems very sensible. Though it sounds like we may need some helper class to allocate the batch and line up the morsels, IIUC.

> [C++] Try and maintain input shape in Acero
> -------------------------------------------
>
>                 Key: ARROW-17593
>                 URL: https://issues.apache.org/jira/browse/ARROW-17593
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> Data is scanned in large chunks based on the format.  For example, CSV scans chunks based on a chunk_size while parquet scans entire row groups.
> Then, upon entry into Acero, these chunks are sliced into morsels (~L3 size) for parallelism and batches (~L1-L2 size) for cache efficient processing.
> However, the way it is currently done, means that the output of Acero is a stream of tiny batches.  This is somewhat undesirable in many cases.
> For example, if a pyarrow user calls pq.read_table they might expect to get one batch per row group.  If they were to turn around and write out that table to a new parquet file then either they end up with a non-ideal parquet file (tiny row groups) or they are forced to concatenate the batches (which is an allocation + copy).
> Even if the user is doing their own streaming processing (e.g. in pyarrow) these small batch sizes are undesirable as the overhead of python means that streaming processing should be done in larger batches.
> Instead, there should be a configurable max_batch_size, independent of row group size and morsel size, which is configurable, and quite large by default (1Mi or 64Mi rows).  This control exists for users that want to do their own streaming processing and need to be able to tune for RAM usage.
> Acero will read in data based on the format, as it does today (e.g. CSV chunk size, row group size).  If the source data is very large (bigger than max_batch_size) it will be sliced.  From that point on, any morsels or batches should simply be views into this larger output batch.  For example, when doing a projection to add a new column, we should allocate a max_batch_size array and then populate it over many runs of the project node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)