You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/07/23 12:10:00 UTC

[jira] [Assigned] (ARROW-16703) [R] Refactor map_batches() so it can stream results

     [ https://issues.apache.org/jira/browse/ARROW-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dewey Dunnington reassigned ARROW-16703:
----------------------------------------

    Assignee: Dewey Dunnington

> [R] Refactor map_batches() so it can stream results
> ---------------------------------------------------
>
>                 Key: ARROW-16703
>                 URL: https://issues.apache.org/jira/browse/ARROW-16703
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 8.0.0
>            Reporter: Will Jones
>            Assignee: Dewey Dunnington
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 10.0.0
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> As part of ARROW-15271, {{map_batches()}} was modified to return a {{RecordBatchReader}}, but the implementation collects all results as a list of record batches and then converts that to a reader. In theory, if we push the implementation down to C++, we should be able to make a proper streaming RBR.
> We won't know the schema ahead of time. We could optionally accept it, which would allow the function to be lazy. Or we could eagerly evaluate just the first batch to determine the schema. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)